Checkpointing vs. Supervision Resilience Approaches for Dynamic Independent Tasks
With the advent of exascale computing, issues such as application irregularity and permanent hardware failure are growing in importance. Irregularity is often addressed by task-based parallel programming coupled with work stealing. At the task level, resilience can be provided by two principal appro...
Uložené v:
| Vydané v: | 2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) s. 556 - 565 |
|---|---|
| Hlavní autori: | , , |
| Médium: | Konferenčný príspevok.. |
| Jazyk: | English |
| Vydavateľské údaje: |
IEEE
01.06.2021
|
| Predmet: | |
| On-line prístup: | Získať plný text |
| Tagy: |
Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
|
| Abstract | With the advent of exascale computing, issues such as application irregularity and permanent hardware failure are growing in importance. Irregularity is often addressed by task-based parallel programming coupled with work stealing. At the task level, resilience can be provided by two principal approaches, namely checkpointing and supervision. For both, particular algorithms have been worked out recently. They perform local recovery and continue the program execution on a reduced set of resources. The checkpointing algorithms regularly save task descriptors explicitly, while the supervision algorithms exploit their natural duplication during work stealing and may be coupled with steal tracking to minimize the number of task re-executions. Thus far, the two groups of algorithms have been targeted at different task models: checkpointing algorithms at dynamic independent tasks, and supervision algorithms at nested fork-join programs.This paper transfers the most advanced supervision algorithm to the dynamic independent tasks model, thus enabling a comparison between checkpointing and supervision. Our comparison includes experiments and running time predictions. Results consistently show typical resilience overheads below 1% for both approaches. The overheads are lower for supervision in practically relevant cases, but checkpointing takes over for order millions of processes. |
|---|---|
| AbstractList | With the advent of exascale computing, issues such as application irregularity and permanent hardware failure are growing in importance. Irregularity is often addressed by task-based parallel programming coupled with work stealing. At the task level, resilience can be provided by two principal approaches, namely checkpointing and supervision. For both, particular algorithms have been worked out recently. They perform local recovery and continue the program execution on a reduced set of resources. The checkpointing algorithms regularly save task descriptors explicitly, while the supervision algorithms exploit their natural duplication during work stealing and may be coupled with steal tracking to minimize the number of task re-executions. Thus far, the two groups of algorithms have been targeted at different task models: checkpointing algorithms at dynamic independent tasks, and supervision algorithms at nested fork-join programs.This paper transfers the most advanced supervision algorithm to the dynamic independent tasks model, thus enabling a comparison between checkpointing and supervision. Our comparison includes experiments and running time predictions. Results consistently show typical resilience overheads below 1% for both approaches. The overheads are lower for supervision in practically relevant cases, but checkpointing takes over for order millions of processes. |
| Author | Fohry, Claudia Posner, Jonas Reitz, Mia |
| Author_xml | – sequence: 1 givenname: Jonas surname: Posner fullname: Posner, Jonas email: jonas.posner@uni-kassel.de organization: University of Kassel,Research Group Programming Languages / Methodologies,Germany – sequence: 2 givenname: Mia surname: Reitz fullname: Reitz, Mia email: mia.reitz@uni-kassel.de organization: University of Kassel,Research Group Programming Languages / Methodologies,Germany – sequence: 3 givenname: Claudia surname: Fohry fullname: Fohry, Claudia email: fohry@uni-kassel.de organization: University of Kassel,Research Group Programming Languages / Methodologies,Germany |
| BookMark | eNotjsFOAjEUAGuiB0G_wMT0B3Zt97Xd9kgWFRISUTAeSem-SgN0m-1Kwt9LopeZ22RG5Dp2EQl55KzknJmn-XK6XH3Jqja8rFjFS8aYNldkxJWSAmRd81vy3uzQ7VMX4hDiNz3lkq5-EvankEMX6QfmcAgYHdJJSn1n3Q4z9V1Pp-doj8HReWwx4QVxoGub9_mO3Hh7yHj_7zH5fHleN7Ni8fY6byaLInCAoeCsEsJvlZft5bA1IFFJcNJqBOWNa6UWTFmnavRGeGm0EbCtEAyiBi1hTB7-ugERN6kPR9ufN0YoppiGX4RYTXc |
| CODEN | IEEPAD |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IL CBEJK RIE RIL |
| DOI | 10.1109/IPDPSW52791.2021.00089 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| EISBN | 1665435771 9781665435772 |
| EndPage | 565 |
| ExternalDocumentID | 9460608 |
| Genre | orig-research |
| GroupedDBID | 6IE 6IL CBEJK RIE RIL |
| ID | FETCH-LOGICAL-i133t-10244fb6f5d279d935e653c5a8e36f9cd58406ac67ef94f598943b2e39ee83853 |
| IEDL.DBID | RIE |
| ISICitedReferencesCount | 1 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000689576200070&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| IngestDate | Wed Oct 01 07:05:05 EDT 2025 |
| IsPeerReviewed | false |
| IsScholarly | false |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-i133t-10244fb6f5d279d935e653c5a8e36f9cd58406ac67ef94f598943b2e39ee83853 |
| PageCount | 10 |
| ParticipantIDs | ieee_primary_9460608 |
| PublicationCentury | 2000 |
| PublicationDate | 2021-June |
| PublicationDateYYYYMMDD | 2021-06-01 |
| PublicationDate_xml | – month: 06 year: 2021 text: 2021-June |
| PublicationDecade | 2020 |
| PublicationTitle | 2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) |
| PublicationTitleAbbrev | IPDPSW |
| PublicationYear | 2021 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| Score | 1.7667503 |
| Snippet | With the advent of exascale computing, issues such as application irregularity and permanent hardware failure are growing in importance. Irregularity is often... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 556 |
| SubjectTerms | Benchmark testing Checkpointing Fault Tolerance Heuristic algorithms Parallel programming Prediction algorithms Resilience Runtime Runtime Systems Target tracking Task-based Parallel Programming Work Stealing |
| Title | Checkpointing vs. Supervision Resilience Approaches for Dynamic Independent Tasks |
| URI | https://ieeexplore.ieee.org/document/9460608 |
| WOSCitedRecordID | wos000689576200070&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3NT8IwFG-QePCkBozf6cGjA7Zu_TgakMiFTMHIjazta1wwsLDB32_bLejBi7eml6aveZ99v_dD6EENmIx1SAKqYxLEQkaBZEYHGZHM2ktGJM882QSbTvliIdIWejxgYQDAN59Bzy39X77eqJ0rlfVFbMNth-w9YozWWK0G9BsORH-SjtLZRxIx4fK-KHSjCR17-y_WFO80xqf_O-4MdX_Qdzg9-JVz1IJ1B70OP0Gtik3umR3wvuzh2a5wiu7KXfgNyvzLayl-aqaEQ4ltQIpHNeU8nhz4bis8z8pV2UXv4-f58CVo6BCC3CaSlTWY1hUbSU2i7R21IAnQhKgk40CoEUrbWGJAM0UZGBGbxI9WlxEQAcCJdcsXqL3erOESYeBMAreZn-EQZzIRUaS4otLlK6FW-gp1nDiWRT3xYtlI4vrv7Rt04uRdN1Ddona13cEdOlb7Ki-39_6ZvgGx95ZG |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3NT8IwFG8ImuhJDRi_7cGjg23t1vZoQCIRyRSM3MjavsUFA4QN_n7bjqAHL96aXpq-5n32_d4PoTvlM0l1QLxYU-JRIUNPskx7KZHM2EtGJE8d2QQbDvlkIpIaut9hYQDANZ9Byy7dX75eqLUtlbUFNeG2RfbuRZSGfoXW2sJ-A1-0-0k3GX1EIRM28wsDO5zQ8rf_4k1xbqN39L8Dj1HzB3-Hk51nOUE1mDfQa-cT1Gy5yB23A94ULTxaL62q24IXfoMi_3J6ih-2c8KhwCYkxd2KdB73d4y3JR6nxaxoovfe47jz5G0JEbzcpJKlMZnGGWcyziJt7qgFiSCOiIpSDiTOhNImmvDjVMUMMkGzyA1XlyEQAcCJccynqD5fzOEMYeBMAje5X8aBpjISYai4iqXNWAKt9DlqWHFMl9XMi-lWEhd_b9-ig6fxy2A66A-fL9GhlX3VTnWF6uVqDddoX23KvFjduCf7BrGimY0 |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2021+IEEE+International+Parallel+and+Distributed+Processing+Symposium+Workshops+%28IPDPSW%29&rft.atitle=Checkpointing+vs.+Supervision+Resilience+Approaches+for+Dynamic+Independent+Tasks&rft.au=Posner%2C+Jonas&rft.au=Reitz%2C+Mia&rft.au=Fohry%2C+Claudia&rft.date=2021-06-01&rft.pub=IEEE&rft.spage=556&rft.epage=565&rft_id=info:doi/10.1109%2FIPDPSW52791.2021.00089&rft.externalDocID=9460608 |