MemSeer: Leverage Memory Failure Distinctions and Multi-Grained Prediction in Ultra-Scale Heterogeneous X86/ARM Clusters
In high-performance ultra-scale cloud computing, heterogeneous clusters consisting of x86 and ARM architecture platforms have become increasingly common to boost performance and energy efficiency. Ensuring high availability in these environments is crucial for meeting service-level agreements. Howev...
Uloženo v:
| Vydáno v: | 2025 62nd ACM/IEEE Design Automation Conference (DAC) s. 1 - 7 |
|---|---|
| Hlavní autoři: | , , , , , , , , , , , |
| Médium: | Konferenční příspěvek |
| Jazyk: | angličtina |
| Vydáno: |
IEEE
22.06.2025
|
| Témata: | |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Abstract | In high-performance ultra-scale cloud computing, heterogeneous clusters consisting of x86 and ARM architecture platforms have become increasingly common to boost performance and energy efficiency. Ensuring high availability in these environments is crucial for meeting service-level agreements. However, DRAM failures, a primary cause of server downtimes, present significant challenges to reliability, availability, and serviceability. This paper provides an in-depth analysis of memory failure characteristics across cross-architecture platforms in large-scale heterogeneous clusters. We introduce MemSeer, an AIOps-integrated tool that utilizes a multi-grained memory failure prediction approach for x86/ARM heterogeneous clusters. MemSeer improves the F1-score by 17.3% and increases recall by an average of 27 \% across different lead times compared to state-of-the-art methods. These advancements show great promise in reducing memory failures in cluster environments, decreasing VM interruptions by up to 42.7% and averaging 24.2% in real-world implementations. |
|---|---|
| AbstractList | In high-performance ultra-scale cloud computing, heterogeneous clusters consisting of x86 and ARM architecture platforms have become increasingly common to boost performance and energy efficiency. Ensuring high availability in these environments is crucial for meeting service-level agreements. However, DRAM failures, a primary cause of server downtimes, present significant challenges to reliability, availability, and serviceability. This paper provides an in-depth analysis of memory failure characteristics across cross-architecture platforms in large-scale heterogeneous clusters. We introduce MemSeer, an AIOps-integrated tool that utilizes a multi-grained memory failure prediction approach for x86/ARM heterogeneous clusters. MemSeer improves the F1-score by 17.3% and increases recall by an average of 27 \% across different lead times compared to state-of-the-art methods. These advancements show great promise in reducing memory failures in cluster environments, decreasing VM interruptions by up to 42.7% and averaging 24.2% in real-world implementations. |
| Author | Liu, Yixuan Shao, Bo Zhang, Wengui Zhao, Jieru Li, Jie Lin, Feilong Gu, Yunfei Wu, Chentao Wu, Xinyuan Guo, Minyi Yang, Kunlin Li, Shiyi |
| Author_xml | – sequence: 1 givenname: Yunfei surname: Gu fullname: Gu, Yunfei organization: Shanghai Jiao Tong University,Shanghai,China – sequence: 2 givenname: Yixuan surname: Liu fullname: Liu, Yixuan organization: Shanghai Jiao Tong University,Shanghai,China – sequence: 3 givenname: Xinyuan surname: Wu fullname: Wu, Xinyuan organization: Shanghai Jiao Tong University,Shanghai,China – sequence: 4 givenname: Bo surname: Shao fullname: Shao, Bo organization: Shanghai Jiao Tong University,Shanghai,China – sequence: 5 givenname: Chentao surname: Wu fullname: Wu, Chentao email: wuct@cs.sjtu.edu.cn organization: Shanghai Jiao Tong University,Shanghai,China – sequence: 6 givenname: Shiyi surname: Li fullname: Li, Shiyi organization: Harbin Institute of Technology,Shenzhen,China – sequence: 7 givenname: Jieru surname: Zhao fullname: Zhao, Jieru organization: Shanghai Jiao Tong University,Shanghai,China – sequence: 8 givenname: Jie surname: Li fullname: Li, Jie organization: Shanghai Jiao Tong University,Shanghai,China – sequence: 9 givenname: Minyi surname: Guo fullname: Guo, Minyi organization: Shanghai Jiao Tong University,Shanghai,China – sequence: 10 givenname: Kunlin surname: Yang fullname: Yang, Kunlin organization: Huawei Technologies Co., Ltd.,China – sequence: 11 givenname: Wengui surname: Zhang fullname: Zhang, Wengui organization: Huawei Technologies Co., Ltd.,China – sequence: 12 givenname: Feilong surname: Lin fullname: Lin, Feilong organization: Huawei Technologies Co., Ltd.,China |
| BookMark | eNo10MtKw0AYBeARdKG1byAyL5B2bpnJuAupbYUExVpwV34mf8pAOpFJIvbtDV5WB74DZ3FuyGXoAhJyz9mCc2aXq7zQMlN2IZhIJ-JSKG4uyNwam0nJUyaZyq7JV4WnHWJ8oCV-YoQj0km6eKZr8O0Yka58P_jgBt-FnkKoaTW2g082EXzAmr5ErP1PS32g-3aIkOwctEi3OGDsjhiwG3v6null_lrRoh37yftbctVA2-P8L2dkv358K7ZJ-bx5KvIyAW7skFgjnVO1dJmCTGvN0YBqhHEpB6tl7VLFWMOssCBQCwbGaNk00jS65i41ckbufnc9Ih4-oj9BPB_-D5HfHXNawA |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IH CBEJK RIE RIO |
| DOI | 10.1109/DAC63849.2025.11132417 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE/IET Electronic Library (IEL) (UW System Shared) IEEE Proceedings Order Plans (POP) 1998-present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE/IET Electronic Library (IEL) (UW System Shared) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| EISBN | 9798331503048 |
| EndPage | 7 |
| ExternalDocumentID | 11132417 |
| Genre | orig-research |
| GroupedDBID | 6IE 6IH CBEJK RIE RIO |
| ID | FETCH-LOGICAL-a179t-973cc4d3c84a86661e7a4f27c51a963dc5400f0929a2e620a7763ff37f6d1c573 |
| IEDL.DBID | RIE |
| IngestDate | Wed Oct 01 07:05:15 EDT 2025 |
| IsPeerReviewed | false |
| IsScholarly | true |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-a179t-973cc4d3c84a86661e7a4f27c51a963dc5400f0929a2e620a7763ff37f6d1c573 |
| PageCount | 7 |
| ParticipantIDs | ieee_primary_11132417 |
| PublicationCentury | 2000 |
| PublicationDate | 2025-June-22 |
| PublicationDateYYYYMMDD | 2025-06-22 |
| PublicationDate_xml | – month: 06 year: 2025 text: 2025-June-22 day: 22 |
| PublicationDecade | 2020 |
| PublicationTitle | 2025 62nd ACM/IEEE Design Automation Conference (DAC) |
| PublicationTitleAbbrev | DAC |
| PublicationYear | 2025 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| Score | 2.3020523 |
| Snippet | In high-performance ultra-scale cloud computing, heterogeneous clusters consisting of x86 and ARM architecture platforms have become increasingly common to... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 1 |
| SubjectTerms | Cloud computing Computer architecture Design automation DRAM Reliability Energy efficiency Heterogeneous Clusters Memory Failures Prediction Program processors Random access memory Reliability engineering Servers |
| Title | MemSeer: Leverage Memory Failure Distinctions and Multi-Grained Prediction in Ultra-Scale Heterogeneous X86/ARM Clusters |
| URI | https://ieeexplore.ieee.org/document/11132417 |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3NS8MwFA9OPHhSceI3OXjt1rRpk3obm3MHN4ZzsNtIk1cYzFb6Ifrf-5JtigcPXkpoQgvvI3kv-f3yCLnz7e699IWnDD64Rp9TcZp4kUwZKMwnpNau2ISYTORikUy3ZHXHhQEABz6Djm26s3xT6MZulXVdWXTORIu0hIg3ZK0t65f5SXfQ66M1cUs_CaLObvCvsilu1Rge_fN_x6T9w7-j0--V5YTsQX5KPsbwOgMo7-kToP3hPEDHFib7SYdqZcHldGD9NXdEhYqq3FDHrvUebRUIMPhFeyhje-kqp_N1XSpvhioCOrKYmAJNCYqmogsZd3vPY9pfN_YWhapN5sOHl_7I29ZN8BS6V-0lItSam1BLriSmJwyE4lkgdMQU-pvRGKX5mY-BkQogDnyFwgyzLBRZbJiORHhG9vMih3NCI6F9lWRBpkPgWpk05TiGpRLjJEzDwwvStmJbvm2uxljuJHb5x_srcmiVY7FWQXBN9uuygRtyoN_rVVXeOoV-ARsmo0E |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3PS8MwFA46BT2pOPG3OXjtlrZp03obm3PiOobbYLeRpq8wmK30h-h_70s2FQ8evJTShhby3kveS74vHyG3TK_eB0xYMsELVxhz0o9DywtiGyTWE4FSRmxCjEbBfB6ON2R1w4UBAAM-g5a-NXv5Sa5qvVTWNrLo3BbbZMfj3GFrutaG92uzsN3rdNGfuCagOF7rq_kv4RQzb_QP_vnHQ9L8YeDR8ffcckS2IDsm7xG8TACKOzoE9EAcCWikgbIftC-XGl5OezpiM0NVKKnMEmr4tdaD1oGABL-ot2X0W7rM6GxVFdKaoJGADjQqJkdngrwu6Tzw253niHZXtT5HoWySWf9-2h1YG-UES2KAVVYoXKV44qqAywALFBuE5KkjlGdLjLhEYZ7GUoapkXTAd5gUOMykqStSP7GVJ9wT0sjyDE4J9YRiMkydVLnAlUzimGMbOw4wU8JC3D0jTd1ti9f14RiLrx47_-P5DdkbTKPhYvg4erog-9pQGnnlOJekURU1XJFd9VYty-LaGPcTxPemiA |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2025+62nd+ACM%2FIEEE+Design+Automation+Conference+%28DAC%29&rft.atitle=MemSeer%3A+Leverage+Memory+Failure+Distinctions+and+Multi-Grained+Prediction+in+Ultra-Scale+Heterogeneous+X86%2FARM+Clusters&rft.au=Gu%2C+Yunfei&rft.au=Liu%2C+Yixuan&rft.au=Wu%2C+Xinyuan&rft.au=Shao%2C+Bo&rft.date=2025-06-22&rft.pub=IEEE&rft.spage=1&rft.epage=7&rft_id=info:doi/10.1109%2FDAC63849.2025.11132417&rft.externalDocID=11132417 |