From Failure to Insight: Analyzing Disk Breakdowns in Large-Scale HPC Environments

Disk failure data provides valuable insights for preventing failures, enhancing storage robustness, guiding system design and deployment, and ensuring reliable operations at data centers. This paper introduces two disk failure datasets collected from large-scale HPC production environments over the...

Full description

Saved in:
Bibliographic Details
Published in:SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis pp. 484 - 495
Main Authors: George, Anjus, Wang, Meng, Hanley, Jesse, Ransom, Garrett Wilson, Bent, John, Zimmer, Christopher
Format: Conference Proceeding
Language:English
Published: IEEE 17.11.2024
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract Disk failure data provides valuable insights for preventing failures, enhancing storage robustness, guiding system design and deployment, and ensuring reliable operations at data centers. This paper introduces two disk failure datasets collected from large-scale HPC production environments over the past five years, comprising over 5,000 failure records from more than 40,000 disks. We analyzed these datasets across multiple dimensions, including temporal, spatial, and relational trends, and performed a comprehensive reliability assessment. Our analysis yielded numerous observations and insights that influence various operational aspects of HPC storage systems. We believe this study offers a holistic understanding of disk failure trends likely to interest the HPC storage community.
AbstractList Disk failure data provides valuable insights for preventing failures, enhancing storage robustness, guiding system design and deployment, and ensuring reliable operations at data centers. This paper introduces two disk failure datasets collected from large-scale HPC production environments over the past five years, comprising over 5,000 failure records from more than 40,000 disks. We analyzed these datasets across multiple dimensions, including temporal, spatial, and relational trends, and performed a comprehensive reliability assessment. Our analysis yielded numerous observations and insights that influence various operational aspects of HPC storage systems. We believe this study offers a holistic understanding of disk failure trends likely to interest the HPC storage community.
Author Ransom, Garrett Wilson
Zimmer, Christopher
Wang, Meng
Hanley, Jesse
Bent, John
George, Anjus
Author_xml – sequence: 1
  givenname: Anjus
  surname: George
  fullname: George, Anjus
  email: georgea@ornl.gov
  organization: Oak Ridge National Laboratory,National Center for Computational Sciences,Oak Ridge,TN,USA
– sequence: 2
  givenname: Meng
  surname: Wang
  fullname: Wang, Meng
  email: wangm12@uchicago.edu
  organization: University of Chicago,Department of Computer Science,Chicago,IL,USA
– sequence: 3
  givenname: Jesse
  surname: Hanley
  fullname: Hanley, Jesse
  email: hanleyja@ornl.gov
  organization: Oak Ridge National Laboratory,National Center for Computational Sciences,Oak Ridge,TN,USA
– sequence: 4
  givenname: Garrett Wilson
  surname: Ransom
  fullname: Ransom, Garrett Wilson
  email: gransom@lanl.gov
  organization: Los Alamos National Laboratory,HPC Infrastructure,Los Alamos,NM,USA
– sequence: 5
  givenname: John
  surname: Bent
  fullname: Bent, John
  email: johnbent@gmail.com
  organization: Los Alamos National Laboratory,High Performance Computing,Los Alamos,NM,USA
– sequence: 6
  givenname: Christopher
  surname: Zimmer
  fullname: Zimmer, Christopher
  email: zimmercj@ornl.gov
  organization: Oak Ridge National Laboratory,National Center for Computational Sciences,Oak Ridge,TN,USA
BookMark eNotzt1KwzAYgOEICursFehBbqDzy28Tz2Zd3aCgOMXDkbbfZliXSlKVefUKevSePbzn5DgMAQm5ZDBlDOz1qnzVgkuYcuByCgAFHJHMFtYIBUIpJcUpyVLyDWhQRoJRZ-SpisOeVs73HxHpONBlSH77Nt7QWXD94duHLb3zaUdvI7pdN3yFRH2gtYtbzFet65EuHks6D58-DmGPYUwX5GTj-oTZfyfkpZo_l4u8frhflrM6d1zpMdfWStOCRMY7-_vaMF1IbY1rUUlgG2eZ6eTGFa1onNSqlarREnkjjW2x68SEXP25HhHX79HvXTysGRgOhRDiB6kZUB8
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/SCW63240.2024.00070
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 9798350355543
EndPage 495
ExternalDocumentID 10820733
Genre orig-research
GrantInformation_xml – fundername: U.S. Department of Energy
  funderid: 10.13039/100000015
GroupedDBID 6IE
6IL
ACM
ALMA_UNASSIGNED_HOLDINGS
CBEJK
RIE
RIL
ID FETCH-LOGICAL-a256t-69948c04e12d9070b1674698ace5401fa918d4fa7c3ba465c45b64e2b489cedd3
IEDL.DBID RIE
ISICitedReferencesCount 0
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001451792300053&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate Wed Aug 27 01:59:34 EDT 2025
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a256t-69948c04e12d9070b1674698ace5401fa918d4fa7c3ba465c45b64e2b489cedd3
PageCount 12
ParticipantIDs ieee_primary_10820733
PublicationCentury 2000
PublicationDate 2024-Nov.-17
PublicationDateYYYYMMDD 2024-11-17
PublicationDate_xml – month: 11
  year: 2024
  text: 2024-Nov.-17
  day: 17
PublicationDecade 2020
PublicationTitle SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis
PublicationTitleAbbrev SC-W
PublicationYear 2024
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssib060584085
Score 1.8893028
Snippet Disk failure data provides valuable insights for preventing failures, enhancing storage robustness, guiding system design and deployment, and ensuring reliable...
SourceID ieee
SourceType Publisher
StartPage 484
SubjectTerms Cause effect analysis
Conferences
Data centers
Electric breakdown
Failure data analysis
Hard disk drives
High performance computing
HPC storage
Production
Reliability
Reliability engineering
Robustness
Summit
Supercomputer
System analysis and design
Title From Failure to Insight: Analyzing Disk Breakdowns in Large-Scale HPC Environments
URI https://ieeexplore.ieee.org/document/10820733
WOSCitedRecordID wos001451792300053&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1NSwMxEA22ePCkYsVvcvC6muxON4lHa5cKUopV7K3ka2Ep7kq79eCvN7NttRcP3kIIBCYJ82Yy7w0h11ZxcMBsBLliEZgwMipnketKr3PjFVjbNJsQw6GcTNRoTVZvuDDe-6b4zN_gsPnLd5VdYqosvPDgr0SStEhLiHRF1tpcHvzeQ7WutbIQZ-p23HtDMXIWosAYNbIZdiTe6qHSuJBs_5-bH5DOLxmPjn7czCHZ8eURec7m1TvNdIFV5bSu6GO5wDD7jjYqI19hJX0oFjN6H0DhzGEKmRYlfcK672gczsXTwahH-1s0tw55zfovvUG0bo8Q6YBT6ihVCqRl4HnsQojLDBIKUiW19QGG8VwrLh3kWtjEaEi7FromBR8bkMp655Jj0i6r0p8Q6kGmhhsZ0AeANYlOdUAiAcrFRgiu4JR00CDTj5UCxnRji7M_5s_JHtocOXtcXJB2PV_6S7JrP-tiMb9qzu0bIkGYmA
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3PS8MwFA46BT2pOPG3OXitJu1bm3h0rmw4x3ATdxtJmkIZtrJ1Hvzrzaud7uLBWyiBwkvC-17yvu8j5NpIDgkw40EqmQfajbRMmZe0hFWpthKMqcwmosFATCZyWJPVKy6MtbZqPrM3OKze8pPCLPGqzJ1wl6-iINgkW2idVdO1VtsHH_hQr6vWFuJM3o7aryhHzlwd6KNKNkNP4jUXlSqJxHv__P0-af7S8ejwJ9EckA2bH5LneF680Vhl2FdOy4L28gUW2ne00hn5dDPpQ7aY0XsHC2cJXiLTLKd97Pz2Rm5lLO0O27SzRnRrkpe4M253vdogwVMOqZReKCUIw8ByP3FFLtNIKQilUMY6IMZTJblIIFWRCbSCsGWgpUOwvgYhjU2S4Ig08iK3x4RaEKHmWjj8AWB0oELlsIgDc76OIi7hhDQxINP3bw2M6SoWp398vyI73fFTf9rvDR7PyC7GHxl8PDonjXK-tBdk23yU2WJ-Wa3hF0kam-E
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=SC24-W%3A+Workshops+of+the+International+Conference+for+High+Performance+Computing%2C+Networking%2C+Storage+and+Analysis&rft.atitle=From+Failure+to+Insight%3A+Analyzing+Disk+Breakdowns+in+Large-Scale+HPC+Environments&rft.au=George%2C+Anjus&rft.au=Wang%2C+Meng&rft.au=Hanley%2C+Jesse&rft.au=Ransom%2C+Garrett+Wilson&rft.date=2024-11-17&rft.pub=IEEE&rft.spage=484&rft.epage=495&rft_id=info:doi/10.1109%2FSCW63240.2024.00070&rft.externalDocID=10820733