RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation Serving
Gespeichert in:
| Titel: | RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation Serving |
|---|---|
| Autoren: | Wenqi Jiang, Suvinay Subramanian, Cat Graves, Gustavo Alonso, Amir Yazdanbakhsh, Vidushi Dadu |
| Quelle: | Proceedings of the 52nd Annual International Symposium on Computer Architecture |
| Publication Status: | Preprint |
| Verlagsinformationen: | ACM, 2025. |
| Publikationsjahr: | 2025 |
| Schlagwörter: | H.3, FOS: Computer and information sciences, Large Language Model, Computer Science - Computation and Language, Computer Science - Artificial Intelligence, C.4, Performance Optimization, Computer System, Computer Science - Information Retrieval, C.1, Computer Architecture, Retrieval-Augmented Generation, Artificial Intelligence (cs.AI), Computer Science - Distributed, Parallel, and Cluster Computing, Distributed, Parallel, and Cluster Computing (cs.DC), Computation and Language (cs.CL), Information Retrieval (cs.IR) |
| Beschreibung: | Retrieval-augmented generation (RAG) is emerging as a popular approach for reliable LLM serving. However, efficient RAG serving remains an open challenge due to the rapid emergence of many RAG variants and the substantial differences in workload characteristics across them. This paper makes three fundamental contributions to advancing RAGserving. First, we introduce RAGSchema, a structured abstraction that captures the wide range of RAG algorithms, serving as a foundation for performance optimization. Second, we analyze several representative RAG workloads with distinct RAGSchema, revealing significant performance variability across theseworkloads. Third, to address this variability and meet diverse performance requirements, we propose RAGO (Retrieval-Augmented Generation Optimizer), a system optimization framework for efficient RAG serving. RAGO achieves up to a 2 × increase in QPS per chip and a 55% reduction in time-to-first-token latency compared to RAG systems built on LLM-system extensions. Proceedings of the 52nd Annual International Symposium on Computer Architecture ISBN:979-8-4007-1261-6 ISSN:2575713X |
| Publikationsart: | Article Conference object |
| Dateibeschreibung: | application/application/pdf |
| DOI: | 10.1145/3695053.3731093 |
| DOI: | 10.3929/ethz-b-000744758 |
| DOI: | 10.48550/arxiv.2503.14649 |
| Zugangs-URL: | http://arxiv.org/abs/2503.14649 http://hdl.handle.net/20.500.11850/744758 |
| Rights: | CC BY |
| Dokumentencode: | edsair.doi.dedup.....ef49ec53941bbd0e9fcea319d5426c5c |
| Datenbank: | OpenAIRE |
| Abstract: | Retrieval-augmented generation (RAG) is emerging as a popular approach for reliable LLM serving. However, efficient RAG serving remains an open challenge due to the rapid emergence of many RAG variants and the substantial differences in workload characteristics across them. This paper makes three fundamental contributions to advancing RAGserving. First, we introduce RAGSchema, a structured abstraction that captures the wide range of RAG algorithms, serving as a foundation for performance optimization. Second, we analyze several representative RAG workloads with distinct RAGSchema, revealing significant performance variability across theseworkloads. Third, to address this variability and meet diverse performance requirements, we propose RAGO (Retrieval-Augmented Generation Optimizer), a system optimization framework for efficient RAG serving. RAGO achieves up to a 2 × increase in QPS per chip and a 55% reduction in time-to-first-token latency compared to RAG systems built on LLM-system extensions.<br />Proceedings of the 52nd Annual International Symposium on Computer Architecture<br />ISBN:979-8-4007-1261-6<br />ISSN:2575713X |
|---|---|
| DOI: | 10.1145/3695053.3731093 |
Nájsť tento článok vo Web of Science