RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation Serving

Gespeichert in:
Bibliographische Detailangaben
Titel: RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation Serving
Autoren: Wenqi Jiang, Suvinay Subramanian, Cat Graves, Gustavo Alonso, Amir Yazdanbakhsh, Vidushi Dadu
Quelle: Proceedings of the 52nd Annual International Symposium on Computer Architecture
Publication Status: Preprint
Verlagsinformationen: ACM, 2025.
Publikationsjahr: 2025
Schlagwörter: H.3, FOS: Computer and information sciences, Large Language Model, Computer Science - Computation and Language, Computer Science - Artificial Intelligence, C.4, Performance Optimization, Computer System, Computer Science - Information Retrieval, C.1, Computer Architecture, Retrieval-Augmented Generation, Artificial Intelligence (cs.AI), Computer Science - Distributed, Parallel, and Cluster Computing, Distributed, Parallel, and Cluster Computing (cs.DC), Computation and Language (cs.CL), Information Retrieval (cs.IR)
Beschreibung: Retrieval-augmented generation (RAG) is emerging as a popular approach for reliable LLM serving. However, efficient RAG serving remains an open challenge due to the rapid emergence of many RAG variants and the substantial differences in workload characteristics across them. This paper makes three fundamental contributions to advancing RAGserving. First, we introduce RAGSchema, a structured abstraction that captures the wide range of RAG algorithms, serving as a foundation for performance optimization. Second, we analyze several representative RAG workloads with distinct RAGSchema, revealing significant performance variability across theseworkloads. Third, to address this variability and meet diverse performance requirements, we propose RAGO (Retrieval-Augmented Generation Optimizer), a system optimization framework for efficient RAG serving. RAGO achieves up to a 2 × increase in QPS per chip and a 55% reduction in time-to-first-token latency compared to RAG systems built on LLM-system extensions.
Proceedings of the 52nd Annual International Symposium on Computer Architecture
ISBN:979-8-4007-1261-6
ISSN:2575713X
Publikationsart: Article
Conference object
Dateibeschreibung: application/application/pdf
DOI: 10.1145/3695053.3731093
DOI: 10.3929/ethz-b-000744758
DOI: 10.48550/arxiv.2503.14649
Zugangs-URL: http://arxiv.org/abs/2503.14649
http://hdl.handle.net/20.500.11850/744758
Rights: CC BY
Dokumentencode: edsair.doi.dedup.....ef49ec53941bbd0e9fcea319d5426c5c
Datenbank: OpenAIRE
Beschreibung
Abstract:Retrieval-augmented generation (RAG) is emerging as a popular approach for reliable LLM serving. However, efficient RAG serving remains an open challenge due to the rapid emergence of many RAG variants and the substantial differences in workload characteristics across them. This paper makes three fundamental contributions to advancing RAGserving. First, we introduce RAGSchema, a structured abstraction that captures the wide range of RAG algorithms, serving as a foundation for performance optimization. Second, we analyze several representative RAG workloads with distinct RAGSchema, revealing significant performance variability across theseworkloads. Third, to address this variability and meet diverse performance requirements, we propose RAGO (Retrieval-Augmented Generation Optimizer), a system optimization framework for efficient RAG serving. RAGO achieves up to a 2 × increase in QPS per chip and a 55% reduction in time-to-first-token latency compared to RAG systems built on LLM-system extensions.<br />Proceedings of the 52nd Annual International Symposium on Computer Architecture<br />ISBN:979-8-4007-1261-6<br />ISSN:2575713X
DOI:10.1145/3695053.3731093