Be Kind, Rewind: Checkpoint & Restore Capability for Improving Reliability of Large-Scale Semiconductor Design

Intel's chip design run in a large-scale globally distributed environment with 600,000 cores. In the current semiconductor market scenario, a combination of factors such as time to market pressure, explosive growth in the mobile market segment and upcoming new markets has led to a significant i...

Celý popis

Uložené v:

Podrobná bibliografia
Vydané v:	2014 International Conference on Intelligent Networking and Collaborative Systems s. 622 - 627
Hlavní autori:	Ljubuncic, Igor, Rozenfeld, Avikam, Goldis, Andrew, Giri, Ravi
Médium:	Konferenčný príspevok..
Jazyk:	English
Vydavateľské údaje:	IEEE 01.09.2014
Predmet:	Checkpoint & Restore Checkpointing Computational modeling Computer architecture CPU design Distributed MultiThreaded Checkpointing DMTCP Engineering Computing Image restoration Information Technology Intel Kernel Reliability
On-line prístup:	Získať plný text
Tagy:	Pridať tag Žiadne tagy, Buďte prvý, kto otaguje tento záznam!

Popis
Shrnutí:	Intel's chip design run in a large-scale globally distributed environment with 600,000 cores. In the current semiconductor market scenario, a combination of factors such as time to market pressure, explosive growth in the mobile market segment and upcoming new markets has led to a significant increase in the demand for and reliability of computing resources. Checkpointing is a capability that can make a significant improvement in improving reliability, however, there is no mature solution that allows periodic snapshots of running compute jobs for replay them at a later time in a consistent manner in a large scale environment. Intel IT has partnered with the Northeastern University (NEU) Distributed Multi-Threaded Checkpointing (DMTCP) team to improve their checkpoint & restore solution for the design computing environment. This paper elaborates on the innovative technological breakthroughs, industry-academy partnership as well as the open-source contribution.
DOI:	10.1109/INCoS.2014.90