From Failure to Insight: Analyzing Disk Breakdowns in Large-Scale HPC Environments
Disk failure data provides valuable insights for preventing failures, enhancing storage robustness, guiding system design and deployment, and ensuring reliable operations at data centers. This paper introduces two disk failure datasets collected from large-scale HPC production environments over the...
Saved in:
| Published in: | SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis pp. 484 - 495 |
|---|---|
| Main Authors: | , , , , , |
| Format: | Conference Proceeding |
| Language: | English |
| Published: |
IEEE
17.11.2024
|
| Subjects: | |
| Online Access: | Get full text |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Abstract | Disk failure data provides valuable insights for preventing failures, enhancing storage robustness, guiding system design and deployment, and ensuring reliable operations at data centers. This paper introduces two disk failure datasets collected from large-scale HPC production environments over the past five years, comprising over 5,000 failure records from more than 40,000 disks. We analyzed these datasets across multiple dimensions, including temporal, spatial, and relational trends, and performed a comprehensive reliability assessment. Our analysis yielded numerous observations and insights that influence various operational aspects of HPC storage systems. We believe this study offers a holistic understanding of disk failure trends likely to interest the HPC storage community. |
|---|---|
| AbstractList | Disk failure data provides valuable insights for preventing failures, enhancing storage robustness, guiding system design and deployment, and ensuring reliable operations at data centers. This paper introduces two disk failure datasets collected from large-scale HPC production environments over the past five years, comprising over 5,000 failure records from more than 40,000 disks. We analyzed these datasets across multiple dimensions, including temporal, spatial, and relational trends, and performed a comprehensive reliability assessment. Our analysis yielded numerous observations and insights that influence various operational aspects of HPC storage systems. We believe this study offers a holistic understanding of disk failure trends likely to interest the HPC storage community. |
| Author | Ransom, Garrett Wilson Zimmer, Christopher Wang, Meng Hanley, Jesse Bent, John George, Anjus |
| Author_xml | – sequence: 1 givenname: Anjus surname: George fullname: George, Anjus email: georgea@ornl.gov organization: Oak Ridge National Laboratory,National Center for Computational Sciences,Oak Ridge,TN,USA – sequence: 2 givenname: Meng surname: Wang fullname: Wang, Meng email: wangm12@uchicago.edu organization: University of Chicago,Department of Computer Science,Chicago,IL,USA – sequence: 3 givenname: Jesse surname: Hanley fullname: Hanley, Jesse email: hanleyja@ornl.gov organization: Oak Ridge National Laboratory,National Center for Computational Sciences,Oak Ridge,TN,USA – sequence: 4 givenname: Garrett Wilson surname: Ransom fullname: Ransom, Garrett Wilson email: gransom@lanl.gov organization: Los Alamos National Laboratory,HPC Infrastructure,Los Alamos,NM,USA – sequence: 5 givenname: John surname: Bent fullname: Bent, John email: johnbent@gmail.com organization: Los Alamos National Laboratory,High Performance Computing,Los Alamos,NM,USA – sequence: 6 givenname: Christopher surname: Zimmer fullname: Zimmer, Christopher email: zimmercj@ornl.gov organization: Oak Ridge National Laboratory,National Center for Computational Sciences,Oak Ridge,TN,USA |
| BookMark | eNotzt1KwzAYgOEICursFehBbqDzy28Tz2Zd3aCgOMXDkbbfZliXSlKVefUKevSePbzn5DgMAQm5ZDBlDOz1qnzVgkuYcuByCgAFHJHMFtYIBUIpJcUpyVLyDWhQRoJRZ-SpisOeVs73HxHpONBlSH77Nt7QWXD94duHLb3zaUdvI7pdN3yFRH2gtYtbzFet65EuHks6D58-DmGPYUwX5GTj-oTZfyfkpZo_l4u8frhflrM6d1zpMdfWStOCRMY7-_vaMF1IbY1rUUlgG2eZ6eTGFa1onNSqlarREnkjjW2x68SEXP25HhHX79HvXTysGRgOhRDiB6kZUB8 |
| CODEN | IEEPAD |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IL CBEJK RIE RIL |
| DOI | 10.1109/SCW63240.2024.00070 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| EISBN | 9798350355543 |
| EndPage | 495 |
| ExternalDocumentID | 10820733 |
| Genre | orig-research |
| GrantInformation_xml | – fundername: U.S. Department of Energy funderid: 10.13039/100000015 |
| GroupedDBID | 6IE 6IL ACM ALMA_UNASSIGNED_HOLDINGS CBEJK RIE RIL |
| ID | FETCH-LOGICAL-a256t-69948c04e12d9070b1674698ace5401fa918d4fa7c3ba465c45b64e2b489cedd3 |
| IEDL.DBID | RIE |
| ISICitedReferencesCount | 0 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001451792300053&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| IngestDate | Wed Aug 27 01:59:34 EDT 2025 |
| IsPeerReviewed | false |
| IsScholarly | false |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-a256t-69948c04e12d9070b1674698ace5401fa918d4fa7c3ba465c45b64e2b489cedd3 |
| PageCount | 12 |
| ParticipantIDs | ieee_primary_10820733 |
| PublicationCentury | 2000 |
| PublicationDate | 2024-Nov.-17 |
| PublicationDateYYYYMMDD | 2024-11-17 |
| PublicationDate_xml | – month: 11 year: 2024 text: 2024-Nov.-17 day: 17 |
| PublicationDecade | 2020 |
| PublicationTitle | SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis |
| PublicationTitleAbbrev | SC-W |
| PublicationYear | 2024 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| SSID | ssib060584085 |
| Score | 1.8893028 |
| Snippet | Disk failure data provides valuable insights for preventing failures, enhancing storage robustness, guiding system design and deployment, and ensuring reliable... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 484 |
| SubjectTerms | Cause effect analysis Conferences Data centers Electric breakdown Failure data analysis Hard disk drives High performance computing HPC storage Production Reliability Reliability engineering Robustness Summit Supercomputer System analysis and design |
| Title | From Failure to Insight: Analyzing Disk Breakdowns in Large-Scale HPC Environments |
| URI | https://ieeexplore.ieee.org/document/10820733 |
| WOSCitedRecordID | wos001451792300053&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1NSwMxEA22ePCkYsVvcvC6muxON4lHa5cKUopV7K3ka2Ep7kq79eCvN7NttRcP3kIIBCYJ82Yy7w0h11ZxcMBsBLliEZgwMipnketKr3PjFVjbNJsQw6GcTNRoTVZvuDDe-6b4zN_gsPnLd5VdYqosvPDgr0SStEhLiHRF1tpcHvzeQ7WutbIQZ-p23HtDMXIWosAYNbIZdiTe6qHSuJBs_5-bH5DOLxmPjn7czCHZ8eURec7m1TvNdIFV5bSu6GO5wDD7jjYqI19hJX0oFjN6H0DhzGEKmRYlfcK672gczsXTwahH-1s0tw55zfovvUG0bo8Q6YBT6ihVCqRl4HnsQojLDBIKUiW19QGG8VwrLh3kWtjEaEi7FromBR8bkMp655Jj0i6r0p8Q6kGmhhsZ0AeANYlOdUAiAcrFRgiu4JR00CDTj5UCxnRji7M_5s_JHtocOXtcXJB2PV_6S7JrP-tiMb9qzu0bIkGYmA |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3PS8MwFA46BT2pOPG3OXitJu1bm3h0rmw4x3ATdxtJmkIZtrJ1Hvzrzaud7uLBWyiBwkvC-17yvu8j5NpIDgkw40EqmQfajbRMmZe0hFWpthKMqcwmosFATCZyWJPVKy6MtbZqPrM3OKze8pPCLPGqzJ1wl6-iINgkW2idVdO1VtsHH_hQr6vWFuJM3o7aryhHzlwd6KNKNkNP4jUXlSqJxHv__P0-af7S8ejwJ9EckA2bH5LneF680Vhl2FdOy4L28gUW2ne00hn5dDPpQ7aY0XsHC2cJXiLTLKd97Pz2Rm5lLO0O27SzRnRrkpe4M253vdogwVMOqZReKCUIw8ByP3FFLtNIKQilUMY6IMZTJblIIFWRCbSCsGWgpUOwvgYhjU2S4Ig08iK3x4RaEKHmWjj8AWB0oELlsIgDc76OIi7hhDQxINP3bw2M6SoWp398vyI73fFTf9rvDR7PyC7GHxl8PDonjXK-tBdk23yU2WJ-Wa3hF0kam-E |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=SC24-W%3A+Workshops+of+the+International+Conference+for+High+Performance+Computing%2C+Networking%2C+Storage+and+Analysis&rft.atitle=From+Failure+to+Insight%3A+Analyzing+Disk+Breakdowns+in+Large-Scale+HPC+Environments&rft.au=George%2C+Anjus&rft.au=Wang%2C+Meng&rft.au=Hanley%2C+Jesse&rft.au=Ransom%2C+Garrett+Wilson&rft.date=2024-11-17&rft.pub=IEEE&rft.spage=484&rft.epage=495&rft_id=info:doi/10.1109%2FSCW63240.2024.00070&rft.externalDocID=10820733 |