ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching

The Transformer architecture has significantly advanced natural language processing (NLP) and has been foundational in developing large language models (LLMs) such as LLaMA and OPT, which have come to dominate a broad range of NLP tasks. Despite their superior accuracy, LLMs present unique challenge...

Full description

Saved in:
Bibliographic Details
Published in:2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA) pp. 1005 - 1017
Main Authors: Zhao, Youpeng, Wu, Di, Wang, Jun
Format: Conference Proceeding
Language:English
Published: IEEE 29.06.2024
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Be the first to leave a comment!
You must be logged in first