Lessons Learned from Memory Errors Observed Over the Lifetime of Cielo

Maintaining the performance of high-performance computing (HPC) applications as failures increase is a major challenge for next-generation extreme-scale systems. Recent work demonstrates that hardware failures are expected to become more common. Few existing studies, however, have examined failures...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:SC18: International Conference for High Performance Computing, Networking, Storage and Analysis s. 554 - 565
Hlavní autoři: Levy, Scott, Ferreira, Kurt B., DeBardeleben, Nathan, Siddiqua, Taniya, Sridharan, Vilas, Baseman, Elisabeth
Médium: Konferenční příspěvek
Jazyk:angličtina
Vydáno: IEEE 01.11.2018
Témata:
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:Maintaining the performance of high-performance computing (HPC) applications as failures increase is a major challenge for next-generation extreme-scale systems. Recent work demonstrates that hardware failures are expected to become more common. Few existing studies, however, have examined failures in the context of the entire lifetime of a single platform. In this paper, we analyze a corpus of empirical failure data collected over the entire five-year lifetime of Cielo, a leadership-class HPC system. Our analysis reveals several important findings about failures on Cielo: (i) its memory (DRAM and SRAM) exhibited no aging effects; detectable, uncorrectable errors (DUE) showed no discernible increase over its five-year lifetime; (ii) contrary to popular belief, correctable DRAM faults are not predictive of future uncorrectable DRAM faults; (iii) the majority of system down events have no identifiable hardware root cause, highlighting the need for more comprehensive logging facilities to improve failure analysis on future systems; and (iv) continued advances will be needed in order for current failure mitigation techniques to be viable on future systems. Our analysis of this corpus of empirical data provides critical analysis of, and guidance for, the deployment of extreme-scale systems.
DOI:10.1109/SC.2018.00046