AttenPIM: Accelerating LLM Attention with Dual-mode GEMV in Processing-in-Memory

Large Language Models (LLMs) have demonstrated unprecedented generative performance across a wide range of applications. While recent heterogeneous architectures attempt to address the memory-bound bottleneck from attention computations by processing-in-memory (PIM) offloading, they overlook two cri...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	2025 62nd ACM/IEEE Design Automation Conference (DAC) S. 1 - 7
Hauptverfasser:	Chen, Liyan, Lyu, Dongxu, Li, Zhenyu, Jiang, Jianfei, Wang, Qin, Mao, Zhigang, Jing, Naifeng
Format:	Tagungsbericht
Sprache:	Englisch
Veröffentlicht:	IEEE 22.06.2025
Schlagworte:	Computer architecture Concurrent computing Design automation Dynamic scheduling Energy consumption Kernel Large language models Layout Parallel processing Resource management
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Large Language Models (LLMs) have demonstrated unprecedented generative performance across a wide range of applications. While recent heterogeneous architectures attempt to address the memory-bound bottleneck from attention computations by processing-in-memory (PIM) offloading, they overlook two critical characteristics of attention GEMVs that distinguish them from traditional PIM scenarios: (1) dynamic matrix dimensions that scale with token length, and (2) distinct GEMV patterns between score computation (Q \times K_{t}) and context computation (S \times V). Existing PIM designs, employing either uniform or transposed computing modes, suffer from inefficiencies in newly generated element preparation or distinct GEMV execution. To address these limitations, we propose AttenPIM, a software-hardware co-design for efficient PIM-based attention acceleration. For bank-level execution, we propose dual-mode computing modes tailored for score and context computations with PIM-oriented data layouts and execution flows for KV storage, supported by a low-cost configurable per-bank PIM unit (PU). For system-level execution, we leverage token-level and head-level concurrency to ensure workload balance and maximize bank PU parallelism. Furthermore, dynamic allocation and kernel fusion methods are proposed to further minimize memory overhead. Experimental results demonstrate that AttenPIM achieves 1.13 \times-5.26 \times speedup and reduces energy consumption by 17 %-49 % compared to two state-of-the-art PIM baselines.
DOI:	10.1109/DAC63849.2025.11133230