PISA: Efficient Precision-Slice Framework for LLMs with Adaptive Numerical Type
Large language models (LLMs) have transformed numerous AI applications, with on-device deployment becoming increasingly important for reducing cloud computing costs and protecting user privacy. However, the astronomical model size and limited hardware resources pose significant deployment challenges...
Uložené v:
| Vydané v: | 2025 62nd ACM/IEEE Design Automation Conference (DAC) s. 1 - 7 |
|---|---|
| Hlavní autori: | , , , , |
| Médium: | Konferenčný príspevok.. |
| Jazyk: | English |
| Vydavateľské údaje: |
IEEE
22.06.2025
|
| Predmet: | |
| On-line prístup: | Získať plný text |
| Tagy: |
Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
|
| Abstract | Large language models (LLMs) have transformed numerous AI applications, with on-device deployment becoming increasingly important for reducing cloud computing costs and protecting user privacy. However, the astronomical model size and limited hardware resources pose significant deployment challenges. Model quantization is a promising approach to mitigate this gap, but the presence of outliers in LLMs reduces its effectiveness. Previous efforts addressed this issue by employing compression-based encoding for mixed-precision quantization. These approaches struggle to balance model accuracy with hardware efficiency due to their value-wise outlier granularity and complex encoding/decoding hardware logic. To address this, we propose PISA (Precision-Slice Framework), an acceleration framework that exploits massive sparsity in the higher-order part of LLMs by splitting 16-bit values into a 4-bit/12-bit format. Crucially, PISA introduces an early bird mechanism that leverages the high-order 4-bit computation to predict the importance of the full calculation result. This mechanism enables efficient computational skips by continuing execution only for important computations and using preset values for less significant ones. This scheme can be efficiently integrated with existing hardware accelerators like systolic arrays without complex encoding/decoding. As a result, PISA outperforms state-of-the-art precision-aware accelerators, achieving a 1.3-4.3 \times performance boost and 14.3-66.7 \% greater energy efficiency, with minimal model accuracy loss. This approach enables more efficient ondevice LLM deployment, effectively balancing computational efficiency and model accuracy. |
|---|---|
| AbstractList | Large language models (LLMs) have transformed numerous AI applications, with on-device deployment becoming increasingly important for reducing cloud computing costs and protecting user privacy. However, the astronomical model size and limited hardware resources pose significant deployment challenges. Model quantization is a promising approach to mitigate this gap, but the presence of outliers in LLMs reduces its effectiveness. Previous efforts addressed this issue by employing compression-based encoding for mixed-precision quantization. These approaches struggle to balance model accuracy with hardware efficiency due to their value-wise outlier granularity and complex encoding/decoding hardware logic. To address this, we propose PISA (Precision-Slice Framework), an acceleration framework that exploits massive sparsity in the higher-order part of LLMs by splitting 16-bit values into a 4-bit/12-bit format. Crucially, PISA introduces an early bird mechanism that leverages the high-order 4-bit computation to predict the importance of the full calculation result. This mechanism enables efficient computational skips by continuing execution only for important computations and using preset values for less significant ones. This scheme can be efficiently integrated with existing hardware accelerators like systolic arrays without complex encoding/decoding. As a result, PISA outperforms state-of-the-art precision-aware accelerators, achieving a 1.3-4.3 \times performance boost and 14.3-66.7 \% greater energy efficiency, with minimal model accuracy loss. This approach enables more efficient ondevice LLM deployment, effectively balancing computational efficiency and model accuracy. |
| Author | Liu, Fangxin Sun, Qingxiao Lu, Liqiang Wang, Zongwu Yang, Ning |
| Author_xml | – sequence: 1 givenname: Ning surname: Yang fullname: Yang, Ning email: yn937391832@sjtu.edu.cn organization: Shanghai Jiao Tong University – sequence: 2 givenname: Zongwu surname: Wang fullname: Wang, Zongwu organization: Shanghai Jiao Tong University – sequence: 3 givenname: Qingxiao surname: Sun fullname: Sun, Qingxiao organization: China University of Petroleum-Beijing – sequence: 4 givenname: Liqiang surname: Lu fullname: Lu, Liqiang organization: Zhejiang University – sequence: 5 givenname: Fangxin surname: Liu fullname: Liu, Fangxin email: liufangxin@sjtu.edu.cn organization: Shanghai Jiao Tong University |
| BookMark | eNo1j91KwzAYQCPohc69gUheoDPJl-bHu1I3HVQ32LweafsFg_0jrY69vQP16sC5OHBuyGXXd0jIPWcLzpl9eMpyBUbahWAiPSsOwhp2QeZWWwPAUwZMmmuy2a532SNdeh-qgN1EtxGrMIa-S3ZNqJCuomvx2MdP6vtIi-J1pMcwfdCsdsMUvpG-fbUYQ-Uauj8NeEuuvGtGnP9xRt5Xy33-khSb53WeFYnj2k5JqrVKvRKCC5sCL0GDlsIYrEuGKlVomFL8bD33lpVYS44gFTgnHWdVDTNy99sNiHgYYmhdPB3-P-EH8QVK1Q |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IH CBEJK RIE RIO |
| DOI | 10.1109/DAC63849.2025.11132980 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP) 1998-present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| EISBN | 9798331503048 |
| EndPage | 7 |
| ExternalDocumentID | 11132980 |
| Genre | orig-research |
| GrantInformation_xml | – fundername: National Natural Science Foundation of China funderid: 10.13039/501100001809 – fundername: Research and Development funderid: 10.13039/100006190 – fundername: Natural Science Foundation of Shanghai funderid: 10.13039/100007219 |
| GroupedDBID | 6IE 6IH CBEJK RIE RIO |
| ID | FETCH-LOGICAL-a179t-57765f622129531b37374288edb0e656e80661373f1f90bed41e3463aa4a10cd3 |
| IEDL.DBID | RIE |
| IngestDate | Wed Oct 01 07:05:15 EDT 2025 |
| IsPeerReviewed | false |
| IsScholarly | true |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-a179t-57765f622129531b37374288edb0e656e80661373f1f90bed41e3463aa4a10cd3 |
| PageCount | 7 |
| ParticipantIDs | ieee_primary_11132980 |
| PublicationCentury | 2000 |
| PublicationDate | 2025-June-22 |
| PublicationDateYYYYMMDD | 2025-06-22 |
| PublicationDate_xml | – month: 06 year: 2025 text: 2025-June-22 day: 22 |
| PublicationDecade | 2020 |
| PublicationTitle | 2025 62nd ACM/IEEE Design Automation Conference (DAC) |
| PublicationTitleAbbrev | DAC |
| PublicationYear | 2025 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| Score | 2.2955277 |
| Snippet | Large language models (LLMs) have transformed numerous AI applications, with on-device deployment becoming increasingly important for reducing cloud computing... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 1 |
| SubjectTerms | Accuracy Adaptation models Computational efficiency Computational modeling Logic Numerical models Performance gain Privacy Quantization (signal) Systolic arrays |
| Title | PISA: Efficient Precision-Slice Framework for LLMs with Adaptive Numerical Type |
| URI | https://ieeexplore.ieee.org/document/11132980 |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LT8MwDI5g4sAJEEO8lQPXbs2jbcqtGqtAKmPSAO02pY0rIU3btLX8fpy0BXHgwC3KQ1GcOLYTfzYhd0oqGQslPWZC6UmJBormJV6GDHyTB2WoXZzt9yyaTNR8Hk9bsLrDwgCAcz6DgS26v3yzLmr7VDZs0qIrtND3oyhswFot6pf58fAhGeFpkhZ-woNB1_lX2hQnNdKjf853TPo_-Ds6_ZYsJ2QPVqfkZfo0S-7p2MV8wHHYo82P482WyO407fysKCqiNMued9S-stLE6I291Oikbr5nltRan33ylo5fR49emw7B08g1lRfgmpF4HIVNjJyTi0igXasUmNwHVMtAofrAsLZkZeznYCQDIUOhtdTML4w4I73VegXnhOaMmwB1PcOjQnJslwVjFvYjAErD1QXpW2osNk3Ei0VHiMs_6q_IoaW5daHi_Jr0qm0NN-Sg-Kw-dttbt09fCOaS6w |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3dT8IwEG8MmuiTGjGKX33wdbB-bOt8IwiBOCYJaHgj3XpLTAgQYP79XsvQ-OCDb0vXZcm117vr3e9-hDwqqWQslPSYCaUnJQYomhd4GDLwTRYUoXZ9tt-TKE3VdBqPKrC6w8IAgCs-g6Z9dLl8s8xLe1XW2tGiK4zQDy11VgXXqnC_zI9bz-0O7idpASg8aO6n_yJOcXajd_rPP56R-g8Cj46-bcs5OYDFBXkdDcbtJ9p1XR_wO5xRMeR44zkqPO3tK60ouqI0SYYbau9ZadvolT3WaFruEjRzauPPOnnrdSedvlcRInga9WbrBVEUovg4mpsYdScTkcDIVikwmQ_omIFCB4LhaMGK2M_ASAZChkJrqZmfG3FJaovlAq4IzRg3AXp7hke55Phe5oxZ4I8AKAxX16RupTFb7XpezPaCaPwx_kCO-5NhMksG6csNObHytwVVnN-S2nZdwh05yj-3H5v1vVuzL-KqljQ |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2025+62nd+ACM%2FIEEE+Design+Automation+Conference+%28DAC%29&rft.atitle=PISA%3A+Efficient+Precision-Slice+Framework+for+LLMs+with+Adaptive+Numerical+Type&rft.au=Yang%2C+Ning&rft.au=Wang%2C+Zongwu&rft.au=Sun%2C+Qingxiao&rft.au=Lu%2C+Liqiang&rft.date=2025-06-22&rft.pub=IEEE&rft.spage=1&rft.epage=7&rft_id=info:doi/10.1109%2FDAC63849.2025.11132980&rft.externalDocID=11132980 |