AttenPIM: Accelerating LLM Attention with Dual-mode GEMV in Processing-in-Memory
Large Language Models (LLMs) have demonstrated unprecedented generative performance across a wide range of applications. While recent heterogeneous architectures attempt to address the memory-bound bottleneck from attention computations by processing-in-memory (PIM) offloading, they overlook two cri...
Uložené v:
| Vydané v: | 2025 62nd ACM/IEEE Design Automation Conference (DAC) s. 1 - 7 |
|---|---|
| Hlavní autori: | , , , , , , |
| Médium: | Konferenčný príspevok.. |
| Jazyk: | English |
| Vydavateľské údaje: |
IEEE
22.06.2025
|
| Predmet: | |
| On-line prístup: | Získať plný text |
| Tagy: |
Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
|
| Abstract | Large Language Models (LLMs) have demonstrated unprecedented generative performance across a wide range of applications. While recent heterogeneous architectures attempt to address the memory-bound bottleneck from attention computations by processing-in-memory (PIM) offloading, they overlook two critical characteristics of attention GEMVs that distinguish them from traditional PIM scenarios: (1) dynamic matrix dimensions that scale with token length, and (2) distinct GEMV patterns between score computation (Q \times K_{t}) and context computation (S \times V). Existing PIM designs, employing either uniform or transposed computing modes, suffer from inefficiencies in newly generated element preparation or distinct GEMV execution. To address these limitations, we propose AttenPIM, a software-hardware co-design for efficient PIM-based attention acceleration. For bank-level execution, we propose dual-mode computing modes tailored for score and context computations with PIM-oriented data layouts and execution flows for KV storage, supported by a low-cost configurable per-bank PIM unit (PU). For system-level execution, we leverage token-level and head-level concurrency to ensure workload balance and maximize bank PU parallelism. Furthermore, dynamic allocation and kernel fusion methods are proposed to further minimize memory overhead. Experimental results demonstrate that AttenPIM achieves 1.13 \times-5.26 \times speedup and reduces energy consumption by 17 %-49 % compared to two state-of-the-art PIM baselines. |
|---|---|
| AbstractList | Large Language Models (LLMs) have demonstrated unprecedented generative performance across a wide range of applications. While recent heterogeneous architectures attempt to address the memory-bound bottleneck from attention computations by processing-in-memory (PIM) offloading, they overlook two critical characteristics of attention GEMVs that distinguish them from traditional PIM scenarios: (1) dynamic matrix dimensions that scale with token length, and (2) distinct GEMV patterns between score computation (Q \times K_{t}) and context computation (S \times V). Existing PIM designs, employing either uniform or transposed computing modes, suffer from inefficiencies in newly generated element preparation or distinct GEMV execution. To address these limitations, we propose AttenPIM, a software-hardware co-design for efficient PIM-based attention acceleration. For bank-level execution, we propose dual-mode computing modes tailored for score and context computations with PIM-oriented data layouts and execution flows for KV storage, supported by a low-cost configurable per-bank PIM unit (PU). For system-level execution, we leverage token-level and head-level concurrency to ensure workload balance and maximize bank PU parallelism. Furthermore, dynamic allocation and kernel fusion methods are proposed to further minimize memory overhead. Experimental results demonstrate that AttenPIM achieves 1.13 \times-5.26 \times speedup and reduces energy consumption by 17 %-49 % compared to two state-of-the-art PIM baselines. |
| Author | Li, Zhenyu Jiang, Jianfei Chen, Liyan Mao, Zhigang Jing, Naifeng Lyu, Dongxu Wang, Qin |
| Author_xml | – sequence: 1 givenname: Liyan surname: Chen fullname: Chen, Liyan email: liyan.chen@sjtu.edu.cn organization: Shanghai Jiao Tong University,Department of Micro/Nano Electronics,Shanghai,China – sequence: 2 givenname: Dongxu surname: Lyu fullname: Lyu, Dongxu email: sjtuj@sjtu.edu.cn organization: Shanghai Jiao Tong University,Department of Micro/Nano Electronics,Shanghai,China – sequence: 3 givenname: Zhenyu surname: Li fullname: Li, Zhenyu organization: Shanghai Jiao Tong University,Department of Micro/Nano Electronics,Shanghai,China – sequence: 4 givenname: Jianfei surname: Jiang fullname: Jiang, Jianfei organization: Shanghai Jiao Tong University,Department of Micro/Nano Electronics,Shanghai,China – sequence: 5 givenname: Qin surname: Wang fullname: Wang, Qin organization: Shanghai Jiao Tong University,Department of Micro/Nano Electronics,Shanghai,China – sequence: 6 givenname: Zhigang surname: Mao fullname: Mao, Zhigang organization: Shanghai Jiao Tong University,Department of Micro/Nano Electronics,Shanghai,China – sequence: 7 givenname: Naifeng surname: Jing fullname: Jing, Naifeng organization: Shanghai Jiao Tong University,Department of Micro/Nano Electronics,Shanghai,China |
| BookMark | eNo1j99KwzAYxSO4C517A5G8QGaSr2kb70o356DFXkxvR9J800CbShuRvb3FPzfnwPkdDpxrchmGgITcCb4Wguv7TVGmkCd6LblUcyQAJPALstKZzgGE4sCT_Io0RYwYmn39QIu2xQ5HE314o1VV0x8U_RDol4_vdPNpOtYPDuluW79SH2gzDi1O09xnPrAa-2E835DFyXQTrv58SV4et4fyiVXPu31ZVMyITEc2i1OoWpMpx5XT3HKZCatSC86mCEYlaJVMEY0Dm3PTanCSJ0KJkzS5gCW5_d31iHj8GH1vxvPx_yh8A3hpS_w |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IH CBEJK RIE RIO |
| DOI | 10.1109/DAC63849.2025.11133230 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP) 1998-present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| EISBN | 9798331503048 |
| EndPage | 7 |
| ExternalDocumentID | 11133230 |
| Genre | orig-research |
| GrantInformation_xml | – fundername: National Natural Science Foundation of China funderid: 10.13039/501100001809 |
| GroupedDBID | 6IE 6IH CBEJK RIE RIO |
| ID | FETCH-LOGICAL-a179t-179d5e5ca75d05d90b0271b56b3db6e3a54eb526eead3b80ac93d204151f2a813 |
| IEDL.DBID | RIE |
| IngestDate | Wed Oct 01 07:05:15 EDT 2025 |
| IsPeerReviewed | false |
| IsScholarly | true |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-a179t-179d5e5ca75d05d90b0271b56b3db6e3a54eb526eead3b80ac93d204151f2a813 |
| PageCount | 7 |
| ParticipantIDs | ieee_primary_11133230 |
| PublicationCentury | 2000 |
| PublicationDate | 2025-June-22 |
| PublicationDateYYYYMMDD | 2025-06-22 |
| PublicationDate_xml | – month: 06 year: 2025 text: 2025-June-22 day: 22 |
| PublicationDecade | 2020 |
| PublicationTitle | 2025 62nd ACM/IEEE Design Automation Conference (DAC) |
| PublicationTitleAbbrev | DAC |
| PublicationYear | 2025 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| Score | 2.295538 |
| Snippet | Large Language Models (LLMs) have demonstrated unprecedented generative performance across a wide range of applications. While recent heterogeneous... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 1 |
| SubjectTerms | Computer architecture Concurrent computing Design automation Dynamic scheduling Energy consumption Kernel Large language models Layout Parallel processing Resource management |
| Title | AttenPIM: Accelerating LLM Attention with Dual-mode GEMV in Processing-in-Memory |
| URI | https://ieeexplore.ieee.org/document/11133230 |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LSwMxEA62ePCkYsU3OXhNu5vdbDbeSh8qdMseVHorecxKQbalbgX_vZntVvHgwVtIAoHJhC-TzDcfIbfGprEA45jlYFishWJK-5in8L4cmQIhtlYtmcjpNJ3NVN6Q1WsuDADUyWfQxWb9l--WdoNPZT2URY_8nblFWlLKLVmrYf2GgeoN-wPvTTHST7jo7ib_kk2pUWN8-M_1jkjnh39H829kOSZ7UJ6QvF_5-23-mN3RvrUeLXDvylc6mWS0HkITU3xXpcONfmOocUPvR9kLXZS04QP4-WxRsgzTaz875Hk8eho8sEYPgWl_bCqsJOoECKulcIFwKjA-pgyNSEzkTAKRFjEYwRPw3hGZNNBWRY4jBz8suE7D6JS0y2UJZ4RKrjUIJ5XHqjgJpOZGa1FwEwAoH8Kckw6aY77alryY7yxx8Uf_JTlAo2MOFedXpF2tN3BN9u1HtXhf39Qb9QXahpSU |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LSwMxEA5aBT2pWPFtDl7T7mY33Y230oct7pY9VOmt5DErBdlK3Qr-ezPbVfHgwVtIAoGZCV8mmS8fIbfaxKEAbZnhoFmohGRSuZwnd7Ec6BwhtlItSaLJJJ7NZFaT1SsuDABUxWfQwmb1lm-XZo1XZW2URQ_cmXmb7Igw5P6GrlXzfn1PtvvdnounEAkoXLS-pv8STqlwY3jwzxUPSfOHgUezb2w5IltQHJOsW7oTbjZO72jXGIcX6L3imSZJSqshNDLFm1XaX6sXhio39H6QPtFFQWtGgJvPFgVLscD2o0keh4Npb8RqRQSm3MYp8S9RK0AYFQnrCSs97bJKX4uODqzuQKBECFrwDrj4CHTsKSMDy5GF7-dcxX5wQhrFsoBTQiOuFAgbSYdWYceLFNdKiZxrD0C6JOaMNNEc89fNpxfzL0uc_9F_Q_ZG0zSZJ-PJwwXZRwdgRRXnl6RRrtZwRXbNe7l4W11XTvsExt2X2w |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2025+62nd+ACM%2FIEEE+Design+Automation+Conference+%28DAC%29&rft.atitle=AttenPIM%3A+Accelerating+LLM+Attention+with+Dual-mode+GEMV+in+Processing-in-Memory&rft.au=Chen%2C+Liyan&rft.au=Lyu%2C+Dongxu&rft.au=Li%2C+Zhenyu&rft.au=Jiang%2C+Jianfei&rft.date=2025-06-22&rft.pub=IEEE&rft.spage=1&rft.epage=7&rft_id=info:doi/10.1109%2FDAC63849.2025.11133230&rft.externalDocID=11133230 |