ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching

The Transformer architecture has significantly advanced natural language processing (NLP) and has been foundational in developing large language models (LLMs) such as LLaMA and OPT, which have come to dominate a broad range of NLP tasks. Despite their superior accuracy, LLMs present unique challenge...

Celý popis

Uložené v:

Podrobná bibliografia
Vydané v:	2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA) s. 1005 - 1017
Hlavní autori:	Zhao, Youpeng, Wu, Di, Wang, Jun
Médium:	Konferenčný príspevok..
Jazyk:	English
Vydavateľské údaje:	IEEE 29.06.2024
Predmet:	Accuracy Heuristic algorithms Large language models Machine Learning Systems Memory management Tensors Throughput Transformers
On-line prístup:	Získať plný text
Tagy:	Pridať tag Žiadne tagy, Buďte prvý, kto otaguje tento záznam!

Buďte prvý, kto okomentuje tento záznam!