AttenPIM: Accelerating LLM Attention with Dual-mode GEMV in Processing-in-Memory

Large Language Models (LLMs) have demonstrated unprecedented generative performance across a wide range of applications. While recent heterogeneous architectures attempt to address the memory-bound bottleneck from attention computations by processing-in-memory (PIM) offloading, they overlook two cri...

Full description

Saved in:

Bibliographic Details
Published in:	2025 62nd ACM/IEEE Design Automation Conference (DAC) pp. 1 - 7
Main Authors:	Chen, Liyan, Lyu, Dongxu, Li, Zhenyu, Jiang, Jianfei, Wang, Qin, Mao, Zhigang, Jing, Naifeng
Format:	Conference Proceeding
Language:	English
Published:	IEEE 22.06.2025
Subjects:	Computer architecture Concurrent computing Design automation Dynamic scheduling Energy consumption Kernel Large language models Layout Parallel processing Resource management
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Large Language Models (LLMs) have demonstrated unprecedented generative performance across a wide range of applications. While recent heterogeneous architectures attempt to address the memory-bound bottleneck from attention computations by processing-in-memory (PIM) offloading, they overlook two critical characteristics of attention GEMVs that distinguish them from traditional PIM scenarios: (1) dynamic matrix dimensions that scale with token length, and (2) distinct GEMV patterns between score computation (Q \times K_{t}) and context computation (S \times V). Existing PIM designs, employing either uniform or transposed computing modes, suffer from inefficiencies in newly generated element preparation or distinct GEMV execution. To address these limitations, we propose AttenPIM, a software-hardware co-design for efficient PIM-based attention acceleration. For bank-level execution, we propose dual-mode computing modes tailored for score and context computations with PIM-oriented data layouts and execution flows for KV storage, supported by a low-cost configurable per-bank PIM unit (PU). For system-level execution, we leverage token-level and head-level concurrency to ensure workload balance and maximize bank PU parallelism. Furthermore, dynamic allocation and kernel fusion methods are proposed to further minimize memory overhead. Experimental results demonstrate that AttenPIM achieves 1.13 \times-5.26 \times speedup and reduces energy consumption by 17 %-49 % compared to two state-of-the-art PIM baselines.
DOI:	10.1109/DAC63849.2025.11133230