PISA: Efficient Precision-Slice Framework for LLMs with Adaptive Numerical Type

Large language models (LLMs) have transformed numerous AI applications, with on-device deployment becoming increasingly important for reducing cloud computing costs and protecting user privacy. However, the astronomical model size and limited hardware resources pose significant deployment challenges...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:2025 62nd ACM/IEEE Design Automation Conference (DAC) s. 1 - 7
Hlavní autori: Yang, Ning, Wang, Zongwu, Sun, Qingxiao, Lu, Liqiang, Liu, Fangxin
Médium: Konferenčný príspevok..
Jazyk:English
Vydavateľské údaje: IEEE 22.06.2025
Predmet:
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Abstract Large language models (LLMs) have transformed numerous AI applications, with on-device deployment becoming increasingly important for reducing cloud computing costs and protecting user privacy. However, the astronomical model size and limited hardware resources pose significant deployment challenges. Model quantization is a promising approach to mitigate this gap, but the presence of outliers in LLMs reduces its effectiveness. Previous efforts addressed this issue by employing compression-based encoding for mixed-precision quantization. These approaches struggle to balance model accuracy with hardware efficiency due to their value-wise outlier granularity and complex encoding/decoding hardware logic. To address this, we propose PISA (Precision-Slice Framework), an acceleration framework that exploits massive sparsity in the higher-order part of LLMs by splitting 16-bit values into a 4-bit/12-bit format. Crucially, PISA introduces an early bird mechanism that leverages the high-order 4-bit computation to predict the importance of the full calculation result. This mechanism enables efficient computational skips by continuing execution only for important computations and using preset values for less significant ones. This scheme can be efficiently integrated with existing hardware accelerators like systolic arrays without complex encoding/decoding. As a result, PISA outperforms state-of-the-art precision-aware accelerators, achieving a 1.3-4.3 \times performance boost and 14.3-66.7 \% greater energy efficiency, with minimal model accuracy loss. This approach enables more efficient ondevice LLM deployment, effectively balancing computational efficiency and model accuracy.
AbstractList Large language models (LLMs) have transformed numerous AI applications, with on-device deployment becoming increasingly important for reducing cloud computing costs and protecting user privacy. However, the astronomical model size and limited hardware resources pose significant deployment challenges. Model quantization is a promising approach to mitigate this gap, but the presence of outliers in LLMs reduces its effectiveness. Previous efforts addressed this issue by employing compression-based encoding for mixed-precision quantization. These approaches struggle to balance model accuracy with hardware efficiency due to their value-wise outlier granularity and complex encoding/decoding hardware logic. To address this, we propose PISA (Precision-Slice Framework), an acceleration framework that exploits massive sparsity in the higher-order part of LLMs by splitting 16-bit values into a 4-bit/12-bit format. Crucially, PISA introduces an early bird mechanism that leverages the high-order 4-bit computation to predict the importance of the full calculation result. This mechanism enables efficient computational skips by continuing execution only for important computations and using preset values for less significant ones. This scheme can be efficiently integrated with existing hardware accelerators like systolic arrays without complex encoding/decoding. As a result, PISA outperforms state-of-the-art precision-aware accelerators, achieving a 1.3-4.3 \times performance boost and 14.3-66.7 \% greater energy efficiency, with minimal model accuracy loss. This approach enables more efficient ondevice LLM deployment, effectively balancing computational efficiency and model accuracy.
Author Liu, Fangxin
Sun, Qingxiao
Lu, Liqiang
Wang, Zongwu
Yang, Ning
Author_xml – sequence: 1
  givenname: Ning
  surname: Yang
  fullname: Yang, Ning
  email: yn937391832@sjtu.edu.cn
  organization: Shanghai Jiao Tong University
– sequence: 2
  givenname: Zongwu
  surname: Wang
  fullname: Wang, Zongwu
  organization: Shanghai Jiao Tong University
– sequence: 3
  givenname: Qingxiao
  surname: Sun
  fullname: Sun, Qingxiao
  organization: China University of Petroleum-Beijing
– sequence: 4
  givenname: Liqiang
  surname: Lu
  fullname: Lu, Liqiang
  organization: Zhejiang University
– sequence: 5
  givenname: Fangxin
  surname: Liu
  fullname: Liu, Fangxin
  email: liufangxin@sjtu.edu.cn
  organization: Shanghai Jiao Tong University
BookMark eNo1j91KwzAYQCPohc69gUheoDPJl-bHu1I3HVQ32LweafsFg_0jrY69vQP16sC5OHBuyGXXd0jIPWcLzpl9eMpyBUbahWAiPSsOwhp2QeZWWwPAUwZMmmuy2a532SNdeh-qgN1EtxGrMIa-S3ZNqJCuomvx2MdP6vtIi-J1pMcwfdCsdsMUvpG-fbUYQ-Uauj8NeEuuvGtGnP9xRt5Xy33-khSb53WeFYnj2k5JqrVKvRKCC5sCL0GDlsIYrEuGKlVomFL8bD33lpVYS44gFTgnHWdVDTNy99sNiHgYYmhdPB3-P-EH8QVK1Q
ContentType Conference Proceeding
DBID 6IE
6IH
CBEJK
RIE
RIO
DOI 10.1109/DAC63849.2025.11132980
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan (POP) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 9798331503048
EndPage 7
ExternalDocumentID 11132980
Genre orig-research
GrantInformation_xml – fundername: National Natural Science Foundation of China
  funderid: 10.13039/501100001809
– fundername: Research and Development
  funderid: 10.13039/100006190
– fundername: Natural Science Foundation of Shanghai
  funderid: 10.13039/100007219
GroupedDBID 6IE
6IH
CBEJK
RIE
RIO
ID FETCH-LOGICAL-a179t-57765f622129531b37374288edb0e656e80661373f1f90bed41e3463aa4a10cd3
IEDL.DBID RIE
IngestDate Wed Oct 01 07:05:15 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a179t-57765f622129531b37374288edb0e656e80661373f1f90bed41e3463aa4a10cd3
PageCount 7
ParticipantIDs ieee_primary_11132980
PublicationCentury 2000
PublicationDate 2025-June-22
PublicationDateYYYYMMDD 2025-06-22
PublicationDate_xml – month: 06
  year: 2025
  text: 2025-June-22
  day: 22
PublicationDecade 2020
PublicationTitle 2025 62nd ACM/IEEE Design Automation Conference (DAC)
PublicationTitleAbbrev DAC
PublicationYear 2025
Publisher IEEE
Publisher_xml – name: IEEE
Score 2.2955277
Snippet Large language models (LLMs) have transformed numerous AI applications, with on-device deployment becoming increasingly important for reducing cloud computing...
SourceID ieee
SourceType Publisher
StartPage 1
SubjectTerms Accuracy
Adaptation models
Computational efficiency
Computational modeling
Logic
Numerical models
Performance gain
Privacy
Quantization (signal)
Systolic arrays
Title PISA: Efficient Precision-Slice Framework for LLMs with Adaptive Numerical Type
URI https://ieeexplore.ieee.org/document/11132980
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LT8MwDI5g4sAJEEO8lQPXbs2jbcqtGqtAKmPSAO02pY0rIU3btLX8fpy0BXHgwC3KQ1GcOLYTfzYhd0oqGQslPWZC6UmJBormJV6GDHyTB2WoXZzt9yyaTNR8Hk9bsLrDwgCAcz6DgS26v3yzLmr7VDZs0qIrtND3oyhswFot6pf58fAhGeFpkhZ-woNB1_lX2hQnNdKjf853TPo_-Ds6_ZYsJ2QPVqfkZfo0S-7p2MV8wHHYo82P482WyO407fysKCqiNMued9S-stLE6I291Oikbr5nltRan33ylo5fR49emw7B08g1lRfgmpF4HIVNjJyTi0igXasUmNwHVMtAofrAsLZkZeznYCQDIUOhtdTML4w4I73VegXnhOaMmwB1PcOjQnJslwVjFvYjAErD1QXpW2osNk3Ei0VHiMs_6q_IoaW5daHi_Jr0qm0NN-Sg-Kw-dttbt09fCOaS6w
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3dT8IwEG8MmuiTGjGKX33wdbB-bOt8IwiBOCYJaHgj3XpLTAgQYP79XsvQ-OCDb0vXZcm117vr3e9-hDwqqWQslPSYCaUnJQYomhd4GDLwTRYUoXZ9tt-TKE3VdBqPKrC6w8IAgCs-g6Z9dLl8s8xLe1XW2tGiK4zQDy11VgXXqnC_zI9bz-0O7idpASg8aO6n_yJOcXajd_rPP56R-g8Cj46-bcs5OYDFBXkdDcbtJ9p1XR_wO5xRMeR44zkqPO3tK60ouqI0SYYbau9ZadvolT3WaFruEjRzauPPOnnrdSedvlcRInga9WbrBVEUovg4mpsYdScTkcDIVikwmQ_omIFCB4LhaMGK2M_ASAZChkJrqZmfG3FJaovlAq4IzRg3AXp7hke55Phe5oxZ4I8AKAxX16RupTFb7XpezPaCaPwx_kCO-5NhMksG6csNObHytwVVnN-S2nZdwh05yj-3H5v1vVuzL-KqljQ
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2025+62nd+ACM%2FIEEE+Design+Automation+Conference+%28DAC%29&rft.atitle=PISA%3A+Efficient+Precision-Slice+Framework+for+LLMs+with+Adaptive+Numerical+Type&rft.au=Yang%2C+Ning&rft.au=Wang%2C+Zongwu&rft.au=Sun%2C+Qingxiao&rft.au=Lu%2C+Liqiang&rft.date=2025-06-22&rft.pub=IEEE&rft.spage=1&rft.epage=7&rft_id=info:doi/10.1109%2FDAC63849.2025.11132980&rft.externalDocID=11132980