MILLION: MasterIng Long-Context LLM Inference Via Outlier-Immunized KV Product QuaNtization

Large language models (LLMs) are increasingly utilized for complex tasks requiring longer context lengths, with some models supporting up to 128 K or 1 M tokens. This trend, however, presents significant challenges in inference speed and memory management. The primary bottleneck in long-context LLM...

Full description

Saved in:

Bibliographic Details
Published in:	2025 62nd ACM/IEEE Design Automation Conference (DAC) pp. 1 - 7
Main Authors:	Wang, Zongwu, Xu, Peng, Liu, Fangxin, Hu, Yiwei, Sun, Qingxiao, Li, Gezi, Li, Cheng, Wang, Xuan, Jiang, Li, Guan, Haibing
Format:	Conference Proceeding
Language:	English
Published:	IEEE 22.06.2025
Subjects:	Accuracy Inference algorithms Kernel Large language models Low latency communication Memory management Performance gain Pipelines Quantization (signal)
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Be the first to leave a comment!