MemSeer: Leverage Memory Failure Distinctions and Multi-Grained Prediction in Ultra-Scale Heterogeneous X86/ARM Clusters

In high-performance ultra-scale cloud computing, heterogeneous clusters consisting of x86 and ARM architecture platforms have become increasingly common to boost performance and energy efficiency. Ensuring high availability in these environments is crucial for meeting service-level agreements. Howev...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:2025 62nd ACM/IEEE Design Automation Conference (DAC) s. 1 - 7
Hlavní autori: Gu, Yunfei, Liu, Yixuan, Wu, Xinyuan, Shao, Bo, Wu, Chentao, Li, Shiyi, Zhao, Jieru, Li, Jie, Guo, Minyi, Yang, Kunlin, Zhang, Wengui, Lin, Feilong
Médium: Konferenčný príspevok..
Jazyk:English
Vydavateľské údaje: IEEE 22.06.2025
Predmet:
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Abstract In high-performance ultra-scale cloud computing, heterogeneous clusters consisting of x86 and ARM architecture platforms have become increasingly common to boost performance and energy efficiency. Ensuring high availability in these environments is crucial for meeting service-level agreements. However, DRAM failures, a primary cause of server downtimes, present significant challenges to reliability, availability, and serviceability. This paper provides an in-depth analysis of memory failure characteristics across cross-architecture platforms in large-scale heterogeneous clusters. We introduce MemSeer, an AIOps-integrated tool that utilizes a multi-grained memory failure prediction approach for x86/ARM heterogeneous clusters. MemSeer improves the F1-score by 17.3% and increases recall by an average of 27 \% across different lead times compared to state-of-the-art methods. These advancements show great promise in reducing memory failures in cluster environments, decreasing VM interruptions by up to 42.7% and averaging 24.2% in real-world implementations.
AbstractList In high-performance ultra-scale cloud computing, heterogeneous clusters consisting of x86 and ARM architecture platforms have become increasingly common to boost performance and energy efficiency. Ensuring high availability in these environments is crucial for meeting service-level agreements. However, DRAM failures, a primary cause of server downtimes, present significant challenges to reliability, availability, and serviceability. This paper provides an in-depth analysis of memory failure characteristics across cross-architecture platforms in large-scale heterogeneous clusters. We introduce MemSeer, an AIOps-integrated tool that utilizes a multi-grained memory failure prediction approach for x86/ARM heterogeneous clusters. MemSeer improves the F1-score by 17.3% and increases recall by an average of 27 \% across different lead times compared to state-of-the-art methods. These advancements show great promise in reducing memory failures in cluster environments, decreasing VM interruptions by up to 42.7% and averaging 24.2% in real-world implementations.
Author Liu, Yixuan
Shao, Bo
Zhang, Wengui
Zhao, Jieru
Li, Jie
Lin, Feilong
Gu, Yunfei
Wu, Chentao
Wu, Xinyuan
Guo, Minyi
Yang, Kunlin
Li, Shiyi
Author_xml – sequence: 1
  givenname: Yunfei
  surname: Gu
  fullname: Gu, Yunfei
  organization: Shanghai Jiao Tong University,Shanghai,China
– sequence: 2
  givenname: Yixuan
  surname: Liu
  fullname: Liu, Yixuan
  organization: Shanghai Jiao Tong University,Shanghai,China
– sequence: 3
  givenname: Xinyuan
  surname: Wu
  fullname: Wu, Xinyuan
  organization: Shanghai Jiao Tong University,Shanghai,China
– sequence: 4
  givenname: Bo
  surname: Shao
  fullname: Shao, Bo
  organization: Shanghai Jiao Tong University,Shanghai,China
– sequence: 5
  givenname: Chentao
  surname: Wu
  fullname: Wu, Chentao
  email: wuct@cs.sjtu.edu.cn
  organization: Shanghai Jiao Tong University,Shanghai,China
– sequence: 6
  givenname: Shiyi
  surname: Li
  fullname: Li, Shiyi
  organization: Harbin Institute of Technology,Shenzhen,China
– sequence: 7
  givenname: Jieru
  surname: Zhao
  fullname: Zhao, Jieru
  organization: Shanghai Jiao Tong University,Shanghai,China
– sequence: 8
  givenname: Jie
  surname: Li
  fullname: Li, Jie
  organization: Shanghai Jiao Tong University,Shanghai,China
– sequence: 9
  givenname: Minyi
  surname: Guo
  fullname: Guo, Minyi
  organization: Shanghai Jiao Tong University,Shanghai,China
– sequence: 10
  givenname: Kunlin
  surname: Yang
  fullname: Yang, Kunlin
  organization: Huawei Technologies Co., Ltd.,China
– sequence: 11
  givenname: Wengui
  surname: Zhang
  fullname: Zhang, Wengui
  organization: Huawei Technologies Co., Ltd.,China
– sequence: 12
  givenname: Feilong
  surname: Lin
  fullname: Lin, Feilong
  organization: Huawei Technologies Co., Ltd.,China
BookMark eNo10MtKw0AYBeARdKG1byAyL5B2bpnJuAupbYUExVpwV34mf8pAOpFJIvbtDV5WB74DZ3FuyGXoAhJyz9mCc2aXq7zQMlN2IZhIJ-JSKG4uyNwam0nJUyaZyq7JV4WnHWJ8oCV-YoQj0km6eKZr8O0Yka58P_jgBt-FnkKoaTW2g082EXzAmr5ErP1PS32g-3aIkOwctEi3OGDsjhiwG3v6null_lrRoh37yftbctVA2-P8L2dkv358K7ZJ-bx5KvIyAW7skFgjnVO1dJmCTGvN0YBqhHEpB6tl7VLFWMOssCBQCwbGaNk00jS65i41ckbufnc9Ih4-oj9BPB_-D5HfHXNawA
ContentType Conference Proceeding
DBID 6IE
6IH
CBEJK
RIE
RIO
DOI 10.1109/DAC63849.2025.11132417
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan (POP) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 9798331503048
EndPage 7
ExternalDocumentID 11132417
Genre orig-research
GroupedDBID 6IE
6IH
CBEJK
RIE
RIO
ID FETCH-LOGICAL-a179t-973cc4d3c84a86661e7a4f27c51a963dc5400f0929a2e620a7763ff37f6d1c573
IEDL.DBID RIE
IngestDate Wed Oct 01 07:05:15 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a179t-973cc4d3c84a86661e7a4f27c51a963dc5400f0929a2e620a7763ff37f6d1c573
PageCount 7
ParticipantIDs ieee_primary_11132417
PublicationCentury 2000
PublicationDate 2025-June-22
PublicationDateYYYYMMDD 2025-06-22
PublicationDate_xml – month: 06
  year: 2025
  text: 2025-June-22
  day: 22
PublicationDecade 2020
PublicationTitle 2025 62nd ACM/IEEE Design Automation Conference (DAC)
PublicationTitleAbbrev DAC
PublicationYear 2025
Publisher IEEE
Publisher_xml – name: IEEE
Score 2.3021479
Snippet In high-performance ultra-scale cloud computing, heterogeneous clusters consisting of x86 and ARM architecture platforms have become increasingly common to...
SourceID ieee
SourceType Publisher
StartPage 1
SubjectTerms Cloud computing
Computer architecture
Design automation
DRAM Reliability
Energy efficiency
Heterogeneous Clusters
Memory Failures
Prediction
Program processors
Random access memory
Reliability engineering
Servers
Title MemSeer: Leverage Memory Failure Distinctions and Multi-Grained Prediction in Ultra-Scale Heterogeneous X86/ARM Clusters
URI https://ieeexplore.ieee.org/document/11132417
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LTwIxEG6EePCkRozv9OC1sNvutltvBEQOQIhIwo1029mEBBezD6P_3raAxoMHb027aZN5bDvt980gdE-FsNF_FJCMc0OiUFufY4EhTKapVFwwlvk8syMxmSSLhZzuyOqeCwMAHnwGbdf0b_lmo2t3VdbxZdGjUDRQQwi-JWvtWL9hIDv9bs9aU-ToJzRu7z_-VTbF7xqD43-ud4JaP_w7PP3eWU7RAeRn6GMMrzOA4gGPwNqf_Q_gsYPJfuKBWjlwOe47f809UaHEKjfYs2vJk6sCAcbO6B5l3Che5Xi-rgpFZlZFgIcOE7OxpgSbusSLhHe6z2PcW9cui0LZQvPB40tvSHZ1E4iy7lURKZjWkWE6iVRiw5MQhIoyKnQcKutvRttTWpAF9mCkKHAaKCtMlmVMZNyEOhbsHDXzTQ4XCCcpNTqNQQKHKDFGBinLwIZBIUhBubxELSe25ds2NcZyL7GrP_qv0ZFTjsNaUXqDmlVRwy061O_VqizuvEK_AG2fosI
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1NTwMhECVaTfSkxhq_5eB1WxbYZfHWtNYau01j26S3hoXZpEndmnZr9N8LtGo8ePBGgEDCzAAD780gdEuFsN4_J0EexybgobY2x4gJmMwyqWLBWO7jzHZFr5eMx7K_Iat7LgwAePAZ1FzR_-WbuV65p7K6T4vOQ7GNdiLOKVnTtTa835DIeqvRtPrEHQGFRrWv7r8Sp_hzo33wzxkPUfWHgYf732fLEdqC4hi9p_AyAFjc4S5YDbQ7AU4dUPYDt9XUwctxy1ls4akKS6wKgz2_NnhweSDA2BHdt4xrxdMCj2blQgUDKyTAHYeKmVtlgvlqicdJXG88p7g5W7k4CssqGrXvh81OsMmcEChrYGUgBdOaG6YTrhLroIQgFM-p0FGorMUZbe9pJCf2aqQoxJQoYbeZPGcij02oI8FOUKWYF3CKcJJRo7MIJMTAE2MkyVgO1hEKQQoayzNUdcs2eV0Hx5h8rdj5H_U3aK8zTLuT7mPv6QLtO0E55BWll6hSLlZwhXb1WzldLq69cD8BEPemCQ
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2025+62nd+ACM%2FIEEE+Design+Automation+Conference+%28DAC%29&rft.atitle=MemSeer%3A+Leverage+Memory+Failure+Distinctions+and+Multi-Grained+Prediction+in+Ultra-Scale+Heterogeneous+X86%2FARM+Clusters&rft.au=Gu%2C+Yunfei&rft.au=Liu%2C+Yixuan&rft.au=Wu%2C+Xinyuan&rft.au=Shao%2C+Bo&rft.date=2025-06-22&rft.pub=IEEE&rft.spage=1&rft.epage=7&rft_id=info:doi/10.1109%2FDAC63849.2025.11132417&rft.externalDocID=11132417