ExaQuery: Proving Data Structure to Unstructured Telemetry Data in Large-Scale HPC
Uloženo v:
| Název: | ExaQuery: Proving Data Structure to Unstructured Telemetry Data in Large-Scale HPC |
|---|---|
| Autoři: | Junaid Ahmed Khan, Martin Molan, Matteo Angelinelli, Andrea Bartolini |
| Zdroj: | ICPE '24: 15th ACM/SPEC International Conference on Performance Engineering Companion of the 15th ACM/SPEC International Conference on Performance Engineering |
| Informace o vydavateli: | ACM, 2024. |
| Rok vydání: | 2024 |
| Témata: | High performance computing (HPC), Operational Data Analytics(ODA),Resource Description Framework(RDF), ontology, SPARQL |
| Popis: | High-performance computing (HPC) is the cornerstone of technological advancements in our digital age, but its management is becoming increasingly challenging, particularly as systems approach exascale. Operational data analytics (ODA) and holistic monitoring frameworks aim to alleviate this burden by collecting live telemetry from HPC systems. ODA frameworks rely on NoSQL databases for scalability, with implicit data structures embedded in metric names, necessitating domain knowledge for navigating telemetry data relations. To address the imperative need for explicit representation of relations in telemetry data, we propose a novel ontology for ODA, which we apply to a real HPC installation. The proposed ontology captures relationships between topological components and links hardware components(compute nodes, rack, systems) with job's execution and allocations collected telemetry. This ontology forms the basis for constructing a knowledge graph, enabling graph queries for ODA. Moreover, we propose a comparative analysis of the complexity (expressed in lines of code) and domain knowledge requirement (qualitatively assessed by informed end-users) of complex query implementation with the proposed method and NoSQL methods commonly employed in today's ODAs. We focused on six queries informed by facility managers' daily operations, aiming to benefit not only facility managers but also system administrators and user support. Our comparative analysis demonstrates that the proposed ontology facilitates the implementation of complex queries with significantly fewer lines of code and domain knowledge required as compared to NoSQL methods. |
| Druh dokumentu: | Article Conference object |
| Popis souboru: | application/pdf |
| DOI: | 10.1145/3629527.3652898 |
| Rights: | CC BY |
| Přístupové číslo: | edsair.doi.dedup.....c85620ea778c86b23b6f3c8b6c97f837 |
| Databáze: | OpenAIRE |
| Abstrakt: | High-performance computing (HPC) is the cornerstone of technological advancements in our digital age, but its management is becoming increasingly challenging, particularly as systems approach exascale. Operational data analytics (ODA) and holistic monitoring frameworks aim to alleviate this burden by collecting live telemetry from HPC systems. ODA frameworks rely on NoSQL databases for scalability, with implicit data structures embedded in metric names, necessitating domain knowledge for navigating telemetry data relations. To address the imperative need for explicit representation of relations in telemetry data, we propose a novel ontology for ODA, which we apply to a real HPC installation. The proposed ontology captures relationships between topological components and links hardware components(compute nodes, rack, systems) with job's execution and allocations collected telemetry. This ontology forms the basis for constructing a knowledge graph, enabling graph queries for ODA. Moreover, we propose a comparative analysis of the complexity (expressed in lines of code) and domain knowledge requirement (qualitatively assessed by informed end-users) of complex query implementation with the proposed method and NoSQL methods commonly employed in today's ODAs. We focused on six queries informed by facility managers' daily operations, aiming to benefit not only facility managers but also system administrators and user support. Our comparative analysis demonstrates that the proposed ontology facilitates the implementation of complex queries with significantly fewer lines of code and domain knowledge required as compared to NoSQL methods. |
|---|---|
| DOI: | 10.1145/3629527.3652898 |
Nájsť tento článok vo Web of Science