Checkpointing vs. Supervision Resilience Approaches for Dynamic Independent Tasks

With the advent of exascale computing, issues such as application irregularity and permanent hardware failure are growing in importance. Irregularity is often addressed by task-based parallel programming coupled with work stealing. At the task level, resilience can be provided by two principal appro...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) s. 556 - 565
Hlavní autori: Posner, Jonas, Reitz, Mia, Fohry, Claudia
Médium: Konferenčný príspevok..
Jazyk:English
Vydavateľské údaje: IEEE 01.06.2021
Predmet:
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Abstract With the advent of exascale computing, issues such as application irregularity and permanent hardware failure are growing in importance. Irregularity is often addressed by task-based parallel programming coupled with work stealing. At the task level, resilience can be provided by two principal approaches, namely checkpointing and supervision. For both, particular algorithms have been worked out recently. They perform local recovery and continue the program execution on a reduced set of resources. The checkpointing algorithms regularly save task descriptors explicitly, while the supervision algorithms exploit their natural duplication during work stealing and may be coupled with steal tracking to minimize the number of task re-executions. Thus far, the two groups of algorithms have been targeted at different task models: checkpointing algorithms at dynamic independent tasks, and supervision algorithms at nested fork-join programs.This paper transfers the most advanced supervision algorithm to the dynamic independent tasks model, thus enabling a comparison between checkpointing and supervision. Our comparison includes experiments and running time predictions. Results consistently show typical resilience overheads below 1% for both approaches. The overheads are lower for supervision in practically relevant cases, but checkpointing takes over for order millions of processes.
AbstractList With the advent of exascale computing, issues such as application irregularity and permanent hardware failure are growing in importance. Irregularity is often addressed by task-based parallel programming coupled with work stealing. At the task level, resilience can be provided by two principal approaches, namely checkpointing and supervision. For both, particular algorithms have been worked out recently. They perform local recovery and continue the program execution on a reduced set of resources. The checkpointing algorithms regularly save task descriptors explicitly, while the supervision algorithms exploit their natural duplication during work stealing and may be coupled with steal tracking to minimize the number of task re-executions. Thus far, the two groups of algorithms have been targeted at different task models: checkpointing algorithms at dynamic independent tasks, and supervision algorithms at nested fork-join programs.This paper transfers the most advanced supervision algorithm to the dynamic independent tasks model, thus enabling a comparison between checkpointing and supervision. Our comparison includes experiments and running time predictions. Results consistently show typical resilience overheads below 1% for both approaches. The overheads are lower for supervision in practically relevant cases, but checkpointing takes over for order millions of processes.
Author Fohry, Claudia
Posner, Jonas
Reitz, Mia
Author_xml – sequence: 1
  givenname: Jonas
  surname: Posner
  fullname: Posner, Jonas
  email: jonas.posner@uni-kassel.de
  organization: University of Kassel,Research Group Programming Languages / Methodologies,Germany
– sequence: 2
  givenname: Mia
  surname: Reitz
  fullname: Reitz, Mia
  email: mia.reitz@uni-kassel.de
  organization: University of Kassel,Research Group Programming Languages / Methodologies,Germany
– sequence: 3
  givenname: Claudia
  surname: Fohry
  fullname: Fohry, Claudia
  email: fohry@uni-kassel.de
  organization: University of Kassel,Research Group Programming Languages / Methodologies,Germany
BookMark eNotjsFOAjEUAGuiB0G_wMT0B3Zt97Xd9kgWFRISUTAeSem-SgN0m-1Kwt9LopeZ22RG5Dp2EQl55KzknJmn-XK6XH3Jqja8rFjFS8aYNldkxJWSAmRd81vy3uzQ7VMX4hDiNz3lkq5-EvankEMX6QfmcAgYHdJJSn1n3Q4z9V1Pp-doj8HReWwx4QVxoGub9_mO3Hh7yHj_7zH5fHleN7Ni8fY6byaLInCAoeCsEsJvlZft5bA1IFFJcNJqBOWNa6UWTFmnavRGeGm0EbCtEAyiBi1hTB7-ugERN6kPR9ufN0YoppiGX4RYTXc
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/IPDPSW52791.2021.00089
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 1665435771
9781665435772
EndPage 565
ExternalDocumentID 9460608
Genre orig-research
GroupedDBID 6IE
6IL
CBEJK
RIE
RIL
ID FETCH-LOGICAL-i133t-10244fb6f5d279d935e653c5a8e36f9cd58406ac67ef94f598943b2e39ee83853
IEDL.DBID RIE
ISICitedReferencesCount 1
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000689576200070&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate Wed Oct 01 07:05:05 EDT 2025
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i133t-10244fb6f5d279d935e653c5a8e36f9cd58406ac67ef94f598943b2e39ee83853
PageCount 10
ParticipantIDs ieee_primary_9460608
PublicationCentury 2000
PublicationDate 2021-June
PublicationDateYYYYMMDD 2021-06-01
PublicationDate_xml – month: 06
  year: 2021
  text: 2021-June
PublicationDecade 2020
PublicationTitle 2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
PublicationTitleAbbrev IPDPSW
PublicationYear 2021
Publisher IEEE
Publisher_xml – name: IEEE
Score 1.7667503
Snippet With the advent of exascale computing, issues such as application irregularity and permanent hardware failure are growing in importance. Irregularity is often...
SourceID ieee
SourceType Publisher
StartPage 556
SubjectTerms Benchmark testing
Checkpointing
Fault Tolerance
Heuristic algorithms
Parallel programming
Prediction algorithms
Resilience
Runtime
Runtime Systems
Target tracking
Task-based Parallel Programming
Work Stealing
Title Checkpointing vs. Supervision Resilience Approaches for Dynamic Independent Tasks
URI https://ieeexplore.ieee.org/document/9460608
WOSCitedRecordID wos000689576200070&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3NT8IwFG-QePCkBozf6cGjA7Zu_TgakMiFTMHIjazta1wwsLDB32_bLejBi7eml6aveZ99v_dD6EENmIx1SAKqYxLEQkaBZEYHGZHM2ktGJM882QSbTvliIdIWejxgYQDAN59Bzy39X77eqJ0rlfVFbMNth-w9YozWWK0G9BsORH-SjtLZRxIx4fK-KHSjCR17-y_WFO80xqf_O-4MdX_Qdzg9-JVz1IJ1B70OP0Gtik3umR3wvuzh2a5wiu7KXfgNyvzLayl-aqaEQ4ltQIpHNeU8nhz4bis8z8pV2UXv4-f58CVo6BCC3CaSlTWY1hUbSU2i7R21IAnQhKgk40CoEUrbWGJAM0UZGBGbxI9WlxEQAcCJdcsXqL3erOESYeBMAreZn-EQZzIRUaS4otLlK6FW-gp1nDiWRT3xYtlI4vrv7Rt04uRdN1Ddona13cEdOlb7Ki-39_6ZvgGx95ZG
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3NT8IwFG8ImuhJDRi_7cGjg23t1vZoQCIRyRSM3MjavsUFA4QN_n7bjqAHL96aXpq-5n32_d4PoTvlM0l1QLxYU-JRIUNPskx7KZHM2EtGJE8d2QQbDvlkIpIaut9hYQDANZ9Byy7dX75eqLUtlbUFNeG2RfbuRZSGfoXW2sJ-A1-0-0k3GX1EIRM28wsDO5zQ8rf_4k1xbqN39L8Dj1HzB3-Hk51nOUE1mDfQa-cT1Gy5yB23A94ULTxaL62q24IXfoMi_3J6ih-2c8KhwCYkxd2KdB73d4y3JR6nxaxoovfe47jz5G0JEbzcpJKlMZnGGWcyziJt7qgFiSCOiIpSDiTOhNImmvDjVMUMMkGzyA1XlyEQAcCJccynqD5fzOEMYeBMAje5X8aBpjISYai4iqXNWAKt9DlqWHFMl9XMi-lWEhd_b9-ig6fxy2A66A-fL9GhlX3VTnWF6uVqDddoX23KvFjduCf7BrGimY0
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2021+IEEE+International+Parallel+and+Distributed+Processing+Symposium+Workshops+%28IPDPSW%29&rft.atitle=Checkpointing+vs.+Supervision+Resilience+Approaches+for+Dynamic+Independent+Tasks&rft.au=Posner%2C+Jonas&rft.au=Reitz%2C+Mia&rft.au=Fohry%2C+Claudia&rft.date=2021-06-01&rft.pub=IEEE&rft.spage=556&rft.epage=565&rft_id=info:doi/10.1109%2FIPDPSW52791.2021.00089&rft.externalDocID=9460608