MemSeer: Leverage Memory Failure Distinctions and Multi-Grained Prediction in Ultra-Scale Heterogeneous X86/ARM Clusters
In high-performance ultra-scale cloud computing, heterogeneous clusters consisting of x86 and ARM architecture platforms have become increasingly common to boost performance and energy efficiency. Ensuring high availability in these environments is crucial for meeting service-level agreements. Howev...
Uloženo v:
| Vydáno v: | 2025 62nd ACM/IEEE Design Automation Conference (DAC) s. 1 - 7 |
|---|---|
| Hlavní autoři: | , , , , , , , , , , , |
| Médium: | Konferenční příspěvek |
| Jazyk: | angličtina |
| Vydáno: |
IEEE
22.06.2025
|
| Témata: | |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Shrnutí: | In high-performance ultra-scale cloud computing, heterogeneous clusters consisting of x86 and ARM architecture platforms have become increasingly common to boost performance and energy efficiency. Ensuring high availability in these environments is crucial for meeting service-level agreements. However, DRAM failures, a primary cause of server downtimes, present significant challenges to reliability, availability, and serviceability. This paper provides an in-depth analysis of memory failure characteristics across cross-architecture platforms in large-scale heterogeneous clusters. We introduce MemSeer, an AIOps-integrated tool that utilizes a multi-grained memory failure prediction approach for x86/ARM heterogeneous clusters. MemSeer improves the F1-score by 17.3% and increases recall by an average of 27 \% across different lead times compared to state-of-the-art methods. These advancements show great promise in reducing memory failures in cluster environments, decreasing VM interruptions by up to 42.7% and averaging 24.2% in real-world implementations. |
|---|---|
| DOI: | 10.1109/DAC63849.2025.11132417 |