Hierarchical Caching for Agentic Workflows: A Multi-Level Architecture to Reduce Tool Execution Overhead.

Uloženo v:
Podrobná bibliografie
Název: Hierarchical Caching for Agentic Workflows: A Multi-Level Architecture to Reduce Tool Execution Overhead.
Autoři: Begum, Farhana, Scott, Craig, Nyarko, Kofi, Jeihani, Mansoureh, Khalifa, Fahmi
Zdroj: Machine Learning & Knowledge Extraction; Feb2026, Vol. 8 Issue 2, p30, 35p
Témata: SOFTWARE architecture, CACHE memory, INTELLIGENT agents
Abstrakt: Large Language Model (LLM) agents depend heavily on multiple external tools such as APIs, databases and computational services to perform complex tasks. However, these tool executions create latency and introduce costs, particularly when agents handle similar queries or workflows. Most current caching methods focus on LLM prompt–response pairs or execution plans and overlook redundancies at the tool level. To address this, we designed a multi-level caching architecture that captures redundancy at both the workflow and tool level. The proposed system integrates four key components: (1) hierarchical caching that operates at both the workflow and tool level to capture coarse and fine-grained redundancies; (2) dependency-aware invalidation using graph-based techniques to maintain consistency when write operations affect cached reads across execution contexts; (3) category-specific time-to-live (TTL) policies tailored to different data types, e.g., weather APIs, user location, database queries and filesystem and computational tasks; and (4) session isolation to ensure multi-tenant cache safety through automatic session scoping. We evaluated the system using synthetic data with 2.25 million queries across ten configurations in fifteen runs. In addition, we conducted four targeted evaluations—write intensity robustness from 4 to 30% writes, personalized memory effects under isolated vs. shared cache modes, workflow-level caching comparison and workload sensitivity across five access distributions—on an additional 2.565 million queries, bringing the total experimental scope to 4.815 million executed queries. The architecture achieved 76.5% caching efficiency, reducing query processing time by 13.3× and lowering estimated costs by 73.3% compared to a no-cache baseline. Multi-tenant testing with fifteen concurrent tenants confirmed robust session isolation and 74.1% efficiency under concurrent workloads. Our evaluation used controlled synthetic workloads following Zipfian distributions, which are commonly used in caching research. While absolute hit rates vary by deployment domain, the architectural principles of hierarchical caching, dependency tracking and session isolation remain broadly applicable. [ABSTRACT FROM AUTHOR]
Copyright of Machine Learning & Knowledge Extraction is the property of MDPI and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Databáze: Complementary Index
Popis
Abstrakt:Large Language Model (LLM) agents depend heavily on multiple external tools such as APIs, databases and computational services to perform complex tasks. However, these tool executions create latency and introduce costs, particularly when agents handle similar queries or workflows. Most current caching methods focus on LLM prompt–response pairs or execution plans and overlook redundancies at the tool level. To address this, we designed a multi-level caching architecture that captures redundancy at both the workflow and tool level. The proposed system integrates four key components: (1) hierarchical caching that operates at both the workflow and tool level to capture coarse and fine-grained redundancies; (2) dependency-aware invalidation using graph-based techniques to maintain consistency when write operations affect cached reads across execution contexts; (3) category-specific time-to-live (TTL) policies tailored to different data types, e.g., weather APIs, user location, database queries and filesystem and computational tasks; and (4) session isolation to ensure multi-tenant cache safety through automatic session scoping. We evaluated the system using synthetic data with 2.25 million queries across ten configurations in fifteen runs. In addition, we conducted four targeted evaluations—write intensity robustness from 4 to 30% writes, personalized memory effects under isolated vs. shared cache modes, workflow-level caching comparison and workload sensitivity across five access distributions—on an additional 2.565 million queries, bringing the total experimental scope to 4.815 million executed queries. The architecture achieved 76.5% caching efficiency, reducing query processing time by 13.3× and lowering estimated costs by 73.3% compared to a no-cache baseline. Multi-tenant testing with fifteen concurrent tenants confirmed robust session isolation and 74.1% efficiency under concurrent workloads. Our evaluation used controlled synthetic workloads following Zipfian distributions, which are commonly used in caching research. While absolute hit rates vary by deployment domain, the architectural principles of hierarchical caching, dependency tracking and session isolation remain broadly applicable. [ABSTRACT FROM AUTHOR]
ISSN:25044990
DOI:10.3390/make8020030