A Fault-Model-Relevant Classification of Consensus Mechanisms for MPI and HPC

Large-scale HPC systems experience failures arising from faults in hardware, software, and/or networking. Failure rates continue to grow as systems scale up and out. Crash fault tolerance has up to now been the focus when considering means to augment the Message Passing Interface (MPI) for fault-tol...

Full description

Saved in:
Bibliographic Details
Published in:International journal of parallel programming Vol. 51; no. 2-3; pp. 128 - 149
Main Authors: Nansamba, Grace, Altarawneh, Amani, Skjellum, Anthony
Format: Journal Article
Language:English
Published: New York Springer US 01.06.2023
Springer Nature B.V
Subjects:
ISSN:0885-7458, 1573-7640
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract Large-scale HPC systems experience failures arising from faults in hardware, software, and/or networking. Failure rates continue to grow as systems scale up and out. Crash fault tolerance has up to now been the focus when considering means to augment the Message Passing Interface (MPI) for fault-tolerant operations. This narrow model of faults (usually restricted only to process or node failures) is insufficient. Without a more general model for consensus, gaps in the ability to detect, isolate, mitigate, and recover HPC applications efficiently will arise. Focusing on crash failures is insufficient because a chain of underlying components may lead to system failures in MPI. What is more, clusters and leadership-class machines alike often have Reliability, Availability, and Serviceability Systems to convey predictive and real-time fault and error information, which does not map strictly to process and node crashes. A broader study of failures beyond crash failures in MPI will thus be useful in conjunction with consensus mechanism for developers as they continue to design, develop, and implement fault-tolerant HPC systems that reflect observable faults in actual systems. We describe key factors that must be considered during consensus-mechanism design. We illustrate some of the current MPI fault tolerance models based on use cases. We offer a novel classification of common consensus mechanisms based on these factors such as the network model, failure types, and based on use cases (e.g., fault detection, synchronization) of the consensus in the computation process, including crash fault tolerance as one category.
AbstractList Large-scale HPC systems experience failures arising from faults in hardware, software, and/or networking. Failure rates continue to grow as systems scale up and out. Crash fault tolerance has up to now been the focus when considering means to augment the Message Passing Interface (MPI) for fault-tolerant operations. This narrow model of faults (usually restricted only to process or node failures) is insufficient. Without a more general model for consensus, gaps in the ability to detect, isolate, mitigate, and recover HPC applications efficiently will arise. Focusing on crash failures is insufficient because a chain of underlying components may lead to system failures in MPI. What is more, clusters and leadership-class machines alike often have Reliability, Availability, and Serviceability Systems to convey predictive and real-time fault and error information, which does not map strictly to process and node crashes. A broader study of failures beyond crash failures in MPI will thus be useful in conjunction with consensus mechanism for developers as they continue to design, develop, and implement fault-tolerant HPC systems that reflect observable faults in actual systems. We describe key factors that must be considered during consensus-mechanism design. We illustrate some of the current MPI fault tolerance models based on use cases. We offer a novel classification of common consensus mechanisms based on these factors such as the network model, failure types, and based on use cases (e.g., fault detection, synchronization) of the consensus in the computation process, including crash fault tolerance as one category.
Author Skjellum, Anthony
Altarawneh, Amani
Nansamba, Grace
Author_xml – sequence: 1
  givenname: Grace
  surname: Nansamba
  fullname: Nansamba, Grace
  email: jpp751@mocs.utc.edu
  organization: University of Tennessee at Chattanooga
– sequence: 2
  givenname: Amani
  surname: Altarawneh
  fullname: Altarawneh, Amani
  organization: Colorado State University
– sequence: 3
  givenname: Anthony
  surname: Skjellum
  fullname: Skjellum, Anthony
  organization: University of Tennessee at Chattanooga
BookMark eNp9kE9LwzAYh4NMcJt-AU8Bz9E0aZrkOIpzgxWH6DmkbaIdXTKTVti3t66C4GHk8BL4Pe-fZwYmzjsDwG2C7xOM-UNMMM8yhAlBwzeV6HgBpgnjFPEsxRMwxUIwxFMmrsAsxh3GWHIhpqBYwKXu2w4VvjYtejGt-dKug3mrY2xsU-mu8Q56C3PvonGxj7Aw1Yd2TdxHaH2AxXYNtavhaptfg0ur22hufuscvC0fX_MV2jw_rfPFBlU0kR2qiKlKygwtZW2NKGUqBBmewaRK04yUQlJLZcppVjPDMBeSMIqtqDNpJKN0Du7GvofgP3sTO7XzfXDDSEW4JJwIfEqJMVUFH2MwVlVNd7qnC7ppVYLVjzw1ylODPHWSp44DSv6hh9DsdTieh-gIxSHs3k342-oM9Q0wt4KF
CitedBy_id crossref_primary_10_1007_s11227_025_07503_4
Cites_doi 10.1145/279227.279229
10.1109/TDSC.2009.4
10.1016/j.jpdc.2009.01.001
10.1145/2802658.2802660
10.1007/3-540-48071-4_10
10.1109/ISNCC49221.2020.9297326
10.1109/DASC.2004.1390734
10.1145/167088.167119
10.1007/978-3-319-89884-1_32
10.1145/2043556.2043583
10.1016/j.parco.2019.02.007
10.1109/ICDCS.1989.37933
10.1109/TSP.2016.2537271
10.1145/2455.214112
10.1177/1094342005056139
10.1145/357172.357176
10.1109/IPDPS.2012.113
10.1109/ACSEAC.2012.27
10.1007/3-540-12689-9_99
10.1109/ICDCS.1999.776549
10.1109/IPDPS.2011.367
10.1145/3286978.3287023
10.1109/DSN.2014.78
10.1109/CCGRID.2017.18
10.1177/1094342014522573
10.1177/1094342013488238
10.1145/42282.42283
10.1109/CCWC47524.2020.9031204
10.1145/3293611.3331591
10.1007/BF01798957
10.1561/2200000016
10.1007/3-540-61769-8_3
10.1109/IPDPSW52791.2021.00095
10.1145/347057.347561
10.1145/2063384.2063443
10.1109/SC.2014.63
10.15863/TAS.2017.04.48.5
10.1145/2831129.2831130
10.1145/3458817.3476155
10.1016/j.future.2020.01.026
10.3390/sym11101198
10.1145/571637.571640
10.1145/2954679.2872374
10.1007/978-3-642-24449-0_29
10.1145/2751504.2751511
10.1145/2402.322398
10.1109/DSN.2013.6575356
10.1145/3126908.3126935
10.1109/IPDPS.2015.29
ContentType Journal Article
Copyright The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2022. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
Copyright_xml – notice: The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2022. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
DBID AAYXX
CITATION
3V.
7SC
7WY
7WZ
7XB
87Z
8AL
8FD
8FE
8FG
8FK
8FL
8G5
ABUWG
AFKRA
ARAPS
AZQEC
BENPR
BEZIV
BGLVJ
CCPQU
DWQXO
FRNLG
F~G
GNUQQ
GUQSH
HCIFZ
JQ2
K60
K6~
K7-
L.-
L7M
L~C
L~D
M0C
M0N
M2O
MBDVC
P5Z
P62
PHGZM
PHGZT
PKEHL
PQBIZ
PQBZA
PQEST
PQGLB
PQQKQ
PQUKI
PRINS
Q9U
DOI 10.1007/s10766-022-00749-y
DatabaseName CrossRef
ProQuest Central (Corporate)
Computer and Information Systems Abstracts
ABI/INFORM Collection
ABI/INFORM Global (PDF only)
ProQuest Central (purchase pre-March 2016)
ABI/INFORM Collection
Computing Database (Alumni Edition)
Technology Research Database
ProQuest SciTech Collection
ProQuest Technology Collection
ProQuest Central (Alumni) (purchase pre-March 2016)
ABI/INFORM Collection (Alumni)
ProQuest Research Library
ProQuest Central
ProQuest Central UK/Ireland
Advanced Technologies & Computer Science Collection
ProQuest Central Essentials - QC
ProQuest Central
Business Premium Collection
Technology Collection
ProQuest One
ProQuest Central Korea
Business Premium Collection (Alumni)
ABI/INFORM Global (Corporate)
ProQuest Central Student
ProQuest Research Library
SciTech Premium Collection
ProQuest Computer Science Collection
ProQuest Business Collection (Alumni Edition)
ProQuest Business Collection
Computer Science Database
ABI/INFORM Professional Advanced
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts – Academic
Computer and Information Systems Abstracts Professional
ABI/INFORM Global
Computing Database
Research Library
Research Library (Corporate)
Advanced Technologies & Aerospace Database
ProQuest Advanced Technologies & Aerospace Collection
ProQuest Central Premium
ProQuest One Academic
ProQuest One Academic Middle East (New)
ProQuest One Business
ProQuest One Business (Alumni)
ProQuest One Academic Eastern Edition (DO NOT USE)
ProQuest One Applied & Life Sciences
ProQuest One Academic (retired)
ProQuest One Academic UKI Edition
ProQuest Central China
ProQuest Central Basic
DatabaseTitle CrossRef
ABI/INFORM Global (Corporate)
ProQuest Business Collection (Alumni Edition)
ProQuest One Business
Research Library Prep
Computer Science Database
ProQuest Central Student
Technology Collection
Technology Research Database
Computer and Information Systems Abstracts – Academic
ProQuest One Academic Middle East (New)
ProQuest Advanced Technologies & Aerospace Collection
ProQuest Central Essentials
ProQuest Computer Science Collection
Computer and Information Systems Abstracts
ProQuest Central (Alumni Edition)
SciTech Premium Collection
ProQuest One Community College
Research Library (Alumni Edition)
ProQuest Central China
ABI/INFORM Complete
ProQuest Central
ABI/INFORM Professional Advanced
ProQuest One Applied & Life Sciences
ProQuest Central Korea
ProQuest Research Library
ProQuest Central (New)
Advanced Technologies Database with Aerospace
ABI/INFORM Complete (Alumni Edition)
Advanced Technologies & Aerospace Collection
Business Premium Collection
ABI/INFORM Global
ProQuest Computing
ABI/INFORM Global (Alumni Edition)
ProQuest Central Basic
ProQuest Computing (Alumni Edition)
ProQuest One Academic Eastern Edition
ProQuest Technology Collection
ProQuest SciTech Collection
ProQuest Business Collection
Computer and Information Systems Abstracts Professional
Advanced Technologies & Aerospace Database
ProQuest One Academic UKI Edition
ProQuest One Business (Alumni)
ProQuest One Academic
ProQuest One Academic (New)
ProQuest Central (Alumni)
Business Premium Collection (Alumni)
DatabaseTitleList ABI/INFORM Global (Corporate)

Database_xml – sequence: 1
  dbid: BENPR
  name: ProQuest Central
  url: https://www.proquest.com/central
  sourceTypes: Aggregation Database
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISSN 1573-7640
EndPage 149
ExternalDocumentID 10_1007_s10766_022_00749_y
GrantInformation_xml – fundername: National Science Foundation
  grantid: CCF-1562659; CCF-1562306; CCF-1617690; CCF-1822191; CCF-1821431
  funderid: http://dx.doi.org/10.13039/100000001
GroupedDBID -4Z
-59
-5G
-BR
-EM
-Y2
-~C
-~X
.4S
.86
.DC
.VR
06D
0R~
0VY
199
1N0
2.D
203
28-
29J
2J2
2JN
2JY
2KG
2LR
2P1
2VQ
2~H
30V
3V.
4.4
406
408
409
40D
40E
5GY
5QI
5VS
67Z
6NX
78A
7WY
8FE
8FG
8FL
8G5
8TC
8UJ
95-
95.
95~
96X
AAAVM
AABHQ
AACDK
AAHNG
AAIAL
AAJBT
AAJKR
AANZL
AAOBN
AARHV
AARTL
AASML
AATNV
AATVU
AAUYE
AAWCG
AAYIU
AAYJJ
AAYQN
AAYTO
AAYZH
ABAKF
ABBBX
ABBXA
ABDBF
ABDPE
ABDZT
ABECU
ABFSI
ABFTD
ABFTV
ABHLI
ABHQN
ABJNI
ABJOX
ABKCH
ABKTR
ABMNI
ABMQK
ABNWP
ABQBU
ABQSL
ABSXP
ABTAH
ABTEG
ABTHY
ABTKH
ABTMW
ABULA
ABUWG
ABWNU
ABXPI
ACAOD
ACBXY
ACDTI
ACGFO
ACGFS
ACHSB
ACHXU
ACIHN
ACKNC
ACMDZ
ACMLO
ACNCT
ACOKC
ACOMO
ACPIV
ACREN
ACUHS
ACZOJ
ADHIR
ADINQ
ADKNI
ADKPE
ADMLS
ADRFC
ADTPH
ADURQ
ADYFF
ADYOE
ADZKW
AEAQA
AEBTG
AEFIE
AEFQL
AEGAL
AEGNC
AEJHL
AEJRE
AEKMD
AEMSY
AENEX
AEOHA
AEPYU
AESKC
AETLH
AEVLU
AEXYK
AFBBN
AFEXP
AFGCZ
AFKRA
AFLOW
AFQWF
AFWTZ
AFYQB
AFZKB
AGAYW
AGDGC
AGGDS
AGJBK
AGMZJ
AGQEE
AGQMX
AGRTI
AGWIL
AGWZB
AGYKE
AHAVH
AHBYD
AHKAY
AHSBF
AHYZX
AIAKS
AIGIU
AIIXL
AILAN
AITGF
AJBLW
AJRNO
AJZVZ
ALMA_UNASSIGNED_HOLDINGS
ALWAN
AMKLP
AMTXH
AMXSW
AMYLF
AOCGG
ARAPS
ARCSS
ARMRJ
AXYYD
AYJHY
AZFZN
AZQEC
B-.
B0M
BA0
BBWZM
BDATZ
BENPR
BEZIV
BGLVJ
BGNMA
BKOMP
BPHCQ
BSONS
CAG
CCPQU
COF
CS3
CSCUP
DDRTE
DL5
DNIVK
DPUIP
DU5
DWQXO
E.L
EAD
EAP
EAS
EBLON
EBS
EDO
EIOEI
EJD
EMK
EPL
ESBYG
ESX
FEDTE
FERAY
FFXSO
FIGPU
FINBP
FNLPD
FRNLG
FRRFC
FSGXE
FWDCC
GGCAI
GGRSB
GJIRD
GNUQQ
GNWQR
GQ6
GQ7
GQ8
GROUPED_ABI_INFORM_COMPLETE
GROUPED_ABI_INFORM_RESEARCH
GUQSH
GXS
H13
HCIFZ
HF~
HG5
HG6
HMJXF
HQYDN
HRMNR
HVGLF
HZ~
H~9
I-F
I09
IHE
IJ-
IKXTQ
ITM
IWAJR
IXC
IZIGR
IZQ
I~X
I~Z
J-C
J0Z
JBSCW
JCJTX
JZLTJ
K60
K6V
K6~
K7-
KDC
KOV
KOW
LAK
LLZTM
M0C
M0N
M2O
M4Y
MA-
MS~
N2Q
NB0
NDZJH
NPVJJ
NQJWS
NU0
O9-
O93
O9G
O9I
O9J
OAM
OVD
P19
P62
P9O
PF0
PQBIZ
PQBZA
PQQKQ
PROAC
PT4
PT5
Q2X
QOK
QOS
R89
R9I
RHV
RNI
RNS
ROL
RPX
RSV
RZC
RZE
RZK
S16
S1Z
S26
S27
S28
S3B
SAP
SCJ
SCLPG
SCO
SDH
SDM
SHX
SISQX
SJYHP
SNE
SNPRN
SNX
SOHCF
SOJ
SPISZ
SRMVM
SSLCW
STPWE
SZN
T13
T16
TAE
TEORI
TN5
TSG
TSK
TSV
TUC
TUS
U2A
U5U
UG4
UOJIU
UTJUX
UZXMN
VC2
VFIZW
VXZ
W23
W48
WH7
WK8
YLTOR
Z45
Z7R
Z7X
Z81
Z83
Z88
Z8R
Z8W
Z92
ZMTXR
ZY4
~8M
~EX
AAPKM
AAYXX
ABBRH
ABDBE
ABFSG
ABRTQ
ACSTC
ADHKG
AEZWR
AFDZB
AFFHD
AFHIU
AFOHR
AGQPQ
AHPBZ
AHWEU
AIXLP
ATHPR
AYFIA
CITATION
PHGZM
PHGZT
PQGLB
7SC
7XB
8AL
8FD
8FK
JQ2
L.-
L7M
L~C
L~D
MBDVC
PKEHL
PQEST
PQUKI
PRINS
Q9U
ID FETCH-LOGICAL-c319t-c2ecb35e3b9dfe8b94882828e02c4462b893f394736d5e507892530f8d69e9533
IEDL.DBID RSV
ISICitedReferencesCount 1
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000897136200002&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 0885-7458
IngestDate Wed Nov 05 01:48:23 EST 2025
Tue Nov 18 22:01:38 EST 2025
Sat Nov 29 01:59:46 EST 2025
Fri Feb 21 02:43:33 EST 2025
IsPeerReviewed true
IsScholarly true
Issue 2-3
Keywords Consensus mechanisms
Fault tolerance
Replication
Fault detection
Synchronization
Message-passing model
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c319t-c2ecb35e3b9dfe8b94882828e02c4462b893f394736d5e507892530f8d69e9533
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
PQID 2792728053
PQPubID 48389
PageCount 22
ParticipantIDs proquest_journals_2792728053
crossref_citationtrail_10_1007_s10766_022_00749_y
crossref_primary_10_1007_s10766_022_00749_y
springer_journals_10_1007_s10766_022_00749_y
PublicationCentury 2000
PublicationDate 20230600
2023-06-00
20230601
PublicationDateYYYYMMDD 2023-06-01
PublicationDate_xml – month: 6
  year: 2023
  text: 20230600
PublicationDecade 2020
PublicationPlace New York
PublicationPlace_xml – name: New York
PublicationTitle International journal of parallel programming
PublicationTitleAbbrev Int J Parallel Prog
PublicationYear 2023
Publisher Springer US
Springer Nature B.V
Publisher_xml – name: Springer US
– name: Springer Nature B.V
References Amin, H.: Toward a scalable, transactional, fault-tolerant message passing interface for petascale and exascale machines. PhD dissertation, The University of Alabama at Birmingham (2014)
Fromentin, E., Raynal, M., Tronel, F.: On classes of problems in asyn- chronous distributed systems with process crashes. In: Proceedings. 19th IEEE International Conference on Distributed Computing Systems (Cat. No.99CB37003), pp. 470–477 (1999). https://doi.org/10.1109/ICDCS.1999.776549
Forum, M.P.I.: MPI: A Message-passing Interface Standard, Version 3.1. (2015). High-Performance Computing Center Stuttgart, University of Stuttgart, (2015). URL https://books.google.com/books?id=Fbv7jwEACAAJ
Huang, S.-T.: Detecting termination of distributed computations by external agents. In: [1989] Proceedings. The 9th International Conference on Distributed Computing Systems, pp. 79–84, (1989). https://doi.org/10.1109/ICDCS.1989.37933
LamportLThe part-time parliamentACM Trans. Comput. Syst.199816213316910.1145/279227.2792291455.68033
Baudet, M., Ching, A., Chursin, A., Danezis, G., Garillot, F., Li, Z., Malkhi, D., Naor, O., Perelman, D., Sonnino, A.: State machine replication in the libra blockchain (2019)
Martino, C. D., Jha, S., Kramer, W., Kalbarczyk, Z., Iyer, R. K.: Logdiver: A tool for measuring resilience of extreme-scale systems and applications. In: Proceedings of the 5th Workshop on Fault Tolerance for HPC at EXtreme Scale, pp. 11–18. Association for Computing Machinery, FTXS ’15, New York, (2015). https://doi.org/10.1145/2751504.2751511
LosadaNGonzálezPMartínMJBosilcaGBouteillerATeranishiKFault tolerance of MPI applications in exascale systems: the ULFM solutionFuture Gener. Comput. Syst.202010646748110.1016/j.future.2020.01.026
Aguilera, M. K., Toueg, S.: Randomization and failure detection: a hybrid approach to solve consensus. Technical report (1996)
MosesYRaynalMRevisiting simultaneous consensus with crash failuresJ. Parallel Distrib. Comput.200969440040910.1016/j.jpdc.2009.01.001
Driscoll, K., Hall, B., Paulitsch, M., Zumsteg, P., Sivencrona, H.: The real byzantine generals. In: The 23rd Digital Avionics Systems Conference (IEEE Cat. No.04CH37576), vol. 2, pp. 6.D.4–61 (2004). https://doi.org/10.1109/DASC.2004.1390734
HurseyJNaughtonTValleeGGrahamRLCotronisYDanalisANikolopoulosDSDongarraJA log-scaling fault tolerant agreement algorithm for a fault tolerant MPIRecent Advances in the Message Passing Interface2011Berlin HeidelbergSpringer25526310.1007/978-3-642-24449-0_29
Omwenga, M., Otim, J., Lumala, A.: Robust mobile cloud services through offline support, pp. 90–93 (2012). https://doi.org/10.1109/ACSEAC.2012.27
SultanaNRüfenachtMSkjellumALagunaIMohrorKFailure recovery for bulk synchronous applications with MPI stagesParallel Comput.20198411410.1016/j.parco.2019.02.007
Leners, J.B., Wu, H., Hung, W.-L., Aguilera, M.K, Walfish, M.: Detecting failures in distributed systems with the falcon spy network, In: Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, pp. 279–294. New York, NY, Association for Computing Machinery (2011). https://doi.org/10.1145/2043556.2043583
Popov, S.: The tangle. White Paper 1(3) (2018)
Giménez, A., Gamblin, T., Bhatele, A., Wood, C., Shoga, K., Marathe, A., Bremer, P.-T., Hamann, B., Schulz, M.: Scrubjay: Deriving knowledge from the disarray of hpc performance data. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. Association for Computing Machinery, SC ’17, New York, (2017). https://doi.org/10.1145/3126908.3126935
Libby, R.: Effective HPC hardware management and failure prediction strategy using IPMI. In: Proceedings of the Linux Symposium. Citeseer, (2003)
FischerMJKarpinskiMThe consensus problem in unreliable distributed systems (a brief survey)Foundations of Computation Theory1983Berlin HeidelbergSpringer12714010.1007/3-540-12689-9_99
DworkCLynchNStockmeyerLConsensus in the presence of partial synchronyJ. ACM198835228832393525410.1145/42282.42283
NowakowskiWNetwork management software for redundant ethernet ringTheor. Appl. Sci.201748242910.15863/TAS.2017.04.48.5
LeesatapornwongsaTLukmanJFLuSGunawiHSTaxDC: a taxonomy of non-deterministic concurrency bugs in datacenter distributed systemsSIGPLAN Not.201651451753010.1145/2954679.2872374
Duan, S.: Building reliable and practical byzantine fault tolerance. PhD dissertation, University of California Davis (2016)
LamportLShostakREPeaseMCThe byzantine generals problemACM Trans. Program. Lang. Syst.19824338240110.1145/357172.3571760483.68021
LamportLThe weak byzantine generals problemJ. ACM198330366867670983910.1145/2402.3223980627.68026
SankaranSSquyresJMBarrettBSahayVLumsdaineADuellJHargrovePRomanEThe lam/mpi checkpoint/restart framework: system-initiated checkpointingInt. J. High Perform. Comput. Appl.200519447949310.1177/1094342005056139
Costa, C. H. A., Park, Y., Rosenburg, B. S., Cher, C.-Y., Ryu, K. D.: A system software approach to proactive memory-error avoidance. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’14, pp. 707–718. IEEE Press, (2014). https://doi.org/10.1109/SC.2014.63
Hassani, A., Skjellum, A., Bangalore, P. V., Brightwell, R.: Practical resilient cases for fa-mpi, a transactional fault-tolerant mpi. In: Proceedings of the 3rd Workshop on Exascale MPI. Association for Computing Machinery, ExaMPI ’15, New York (2015). https://doi.org/10.1145/2831129.2831130
Woo, S., Lang, S., Latham, R., Ross, R., Thakur, R.: Reliable MPI-IO through layout-aware replication (2011)
Yin, M., Malkhi, D., Reiter, M. K., Gueta, G. G., Abraham, I.: Hotstuff: Bft consensus with linearity and responsiveness. In: Proceedings of the 2019 ACM Symposium on Principles of Distributed Computing, pp. 347–356. ACM (2019)
Borowsky, E., Gafni, E.: Generalized flp impossibility result fort-resilient asynchronous computations. In: Proceedings of the Twenty-Fifth Annual ACM Symposium on Theory of Computing. STOC ’93, pp. 91–100. Association for Computing Machinery, New York. ISBN 0897915917. (1993). https://doi.org/10.1145/167088.167119
Fan, X., Chai, Q.: Roll-dpos: A randomized delegated proof of stake scheme for scalable blockchain-based internet of things systems. In: Proceedings of the 15th EAI International Conference on Mobile and Ubiquitous Systems: Computing, Networking and Services. MobiQuitous ’18, pp. 482–484. New York (2018). https://doi.org/10.1145/3286978.3287023
De Angelis, S.: Assessing security and performances of consensus algorithms for permissioned blockchains. arXiv preprint arXiv:1805.03490, (2018)
García-PérezÁGotsmanAMeshmanYSergeyIAhmedAPaxos consensus, deconstructed and abstractedProgramming Languages and Systems2018ChamSpringer International Publishing91293910.1007/978-3-319-89884-1_321418.68017
Guo, H., Di, S., Gupta, R., Peterka, T., Cappello, F.: La VALSE: scalable log visualization for fault characterization in supercomputers. In: Childs, H., Cucchietti, F. (eds.) Eurographics Symposium on Parallel Graphics and Visualization. The Eurographics Association (2018)
SnirMWisniewskiRWAbrahamJAAdveSVBagchiSBalajiPBelakJBosePCappelloFCarlsonBChienAACoteusPDebardelebenNADinizPCEngelmannCErezMFazzariSGeistAGuptaRJohnsonFKrishnamoorthySLeyfferSLibertyDMitraSMunsonTSchreiberRStearleyJHensbergenEVAddressing failures in exascale computingInt. J. High Perform. Comput. Appl.201428212917310.1177/1094342014522573
King, S., Nadal, S.: Ppcoin: Peer-to-peer crypto-currency with proof-of-stake. self-published paper, (2012)
Miguel CastroBLPractical byzantine fault tolerance and proactive recoveryACM Trans. Comput. Syst.200220439846110.1145/571637.571640
Moise, I.: Efficient agreement protocols in asynchronous distributed systems. In: 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum, pp. 2022–2025. IEEE, (2011)
Buntinas, D.: Scalable distributed consensus to support mpi fault tolerance. In : 2012 IEEE 26th International Parallel and Distributed Processing Symposium, pp. 1240–1249 (2012). https://doi.org/10.1109/IPDPS.2012.113
BoydSParikhNChuEPeleatoBEcksteinJDistributed optimization and statistical learning via the alternating direction method of multipliersFound. Trends Mach. Learn.201131112210.1561/22000000161229.90122
Altarawneh, A., Skjellum, A.: The security ingredients for correct and byzantine fault-tolerant blockchain consensus algorithms. In: 2020 International Symposium on Networks, Computers and Communications (ISNCC), pp. 1–9, (2020). https://doi.org/10.1109/ISNCC49221.2020.9297326
ChangT-HHongMLiaoW-CWangXAsynchronous distributed admm for large-scale optimization-part i: algorithm and convergence analysisIEEE Trans. Signal Process.2016641231183130349410710.1109/TSP.2016.25372711414.94106
Katti, A., Di Fatta, G., Naughton, T., Engelmann, C.: Scalable and fault tolerant failure detection and consensus. In: Proceedings of the 22nd European MPI Users’ Group Meeting. Association for Computing Machinery, EuroMPI ’15, New York, (2015) https://doi.org/10.1145/2802658.2802660
Altarawneh, A., Herschberg, T., Medury, S., Kandah, F., Skjellum, A.: Buterin’s scalability trilemma viewed through a state-change-based classification for common consensus algorithms. In: 2020 10th Annual Computing and Communication Workshop and Conference (CCWC), pp. 0727–0736 (2020). https://doi.org/10.1109/CCWC47524.2020.9031204
IsmailLMaterwalaHA review of blockchain architecture and consensus protocols: use cases, challenges, and solutionsSymmetry201910.3390/sym11101198
Ferreira, K., Stearley, J., Laros, J. H., Oldfield, R., Pedretti, K., Brightwell, R., Riesen, R., Bridges, P. G., Arnold, D.: Evaluating the viability of process replication reliability for exascale systems. In: SC ’11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–12. (2011b). https://doi.org/10.1145/2063384.2063443
El-Sayed, N., Schroeder, B.: Reading between the lines of failure l
M Snir (749_CR54) 2014; 28
BL Miguel Castro (749_CR67) 2002; 20
749_CR63
T Brokaw (749_CR43) 2000; 72
749_CR60
749_CR22
749_CR21
749_CR27
749_CR26
749_CR24
D Dolev (749_CR6) 1985; 32
A Bar-Noy (749_CR16) 1991; 4
749_CR29
Y Moses (749_CR47) 2009; 69
749_CR28
L Lamport (749_CR51) 1982; 4
B Schroeder (749_CR56) 2009; 7
T Leesatapornwongsa (749_CR57) 2016; 51
C Dwork (749_CR32) 1993
749_CR30
L Lamport (749_CR33) 1983; 30
749_CR34
749_CR31
Á García-Pérez (749_CR66) 2018
749_CR36
749_CR35
T-H Chang (749_CR23) 2016; 64
S Sankaran (749_CR38) 2005; 19
749_CR39
C Dwork (749_CR65) 1988; 35
749_CR40
J Hursey (749_CR64) 2011
749_CR45
749_CR44
749_CR42
J Stone (749_CR59) 2000; 30
749_CR49
W Nowakowski (749_CR11) 2017; 48
MJ Fischer (749_CR37) 1983
749_CR48
749_CR46
N Losada (749_CR62) 2020; 106
749_CR1
749_CR4
749_CR5
749_CR2
L Ismail (749_CR20) 2019
749_CR8
749_CR9
749_CR7
749_CR52
749_CR50
749_CR12
749_CR55
749_CR10
749_CR53
749_CR15
749_CR14
749_CR58
N Sultana (749_CR3) 2019; 84
749_CR13
S Boyd (749_CR61) 2011; 3
749_CR19
W Bland (749_CR25) 2013; 27
749_CR18
749_CR17
L Lamport (749_CR41) 1998; 16
References_xml – reference: IsmailLMaterwalaHA review of blockchain architecture and consensus protocols: use cases, challenges, and solutionsSymmetry201910.3390/sym11101198
– reference: BlandWBouteillerAHeraultTBosilcaGDongarraJPost-failure recovery of MPI communication capability: design and rationaleInt. J. High Perform. Comput. Appl.201327324425410.1177/1094342013488238
– reference: García-PérezÁGotsmanAMeshmanYSergeyIAhmedAPaxos consensus, deconstructed and abstractedProgramming Languages and Systems2018ChamSpringer International Publishing91293910.1007/978-3-319-89884-1_321418.68017
– reference: DworkCLynchNStockmeyerLConsensus in the presence of partial synchronyJ. ACM198835228832393525410.1145/42282.42283
– reference: Di, S., Gupta, R., Snir, M., Pershey, E., Cappello, F.: Logaider: A tool for mining potential correlations of hpc log events. In: 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pp. 442–451 (2017). https://doi.org/10.1109/CCGRID.2017.18
– reference: Altarawneh, A., Skjellum, A.: The security ingredients for correct and byzantine fault-tolerant blockchain consensus algorithms. In: 2020 International Symposium on Networks, Computers and Communications (ISNCC), pp. 1–9, (2020). https://doi.org/10.1109/ISNCC49221.2020.9297326
– reference: SchroederBGibsonGAA large-scale study of failures in high-performance computing systemsIEEE Trans. Depend. Secur. Comput.20097433735010.1109/TDSC.2009.4
– reference: Libby, R.: Effective HPC hardware management and failure prediction strategy using IPMI. In: Proceedings of the Linux Symposium. Citeseer, (2003)
– reference: Amin, H.: Toward a scalable, transactional, fault-tolerant message passing interface for petascale and exascale machines. PhD dissertation, The University of Alabama at Birmingham (2014)
– reference: Borowsky, E., Gafni, E.: Generalized flp impossibility result fort-resilient asynchronous computations. In: Proceedings of the Twenty-Fifth Annual ACM Symposium on Theory of Computing. STOC ’93, pp. 91–100. Association for Computing Machinery, New York. ISBN 0897915917. (1993). https://doi.org/10.1145/167088.167119
– reference: NowakowskiWNetwork management software for redundant ethernet ringTheor. Appl. Sci.201748242910.15863/TAS.2017.04.48.5
– reference: ChangT-HHongMLiaoW-CWangXAsynchronous distributed admm for large-scale optimization-part i: algorithm and convergence analysisIEEE Trans. Signal Process.2016641231183130349410710.1109/TSP.2016.25372711414.94106
– reference: LamportLThe part-time parliamentACM Trans. Comput. Syst.199816213316910.1145/279227.2792291455.68033
– reference: Fromentin, E., Raynal, M., Tronel, F.: On classes of problems in asyn- chronous distributed systems with process crashes. In: Proceedings. 19th IEEE International Conference on Distributed Computing Systems (Cat. No.99CB37003), pp. 470–477 (1999). https://doi.org/10.1109/ICDCS.1999.776549
– reference: De Angelis, S.: Assessing security and performances of consensus algorithms for permissioned blockchains. arXiv preprint arXiv:1805.03490, (2018)
– reference: LeesatapornwongsaTLukmanJFLuSGunawiHSTaxDC: a taxonomy of non-deterministic concurrency bugs in datacenter distributed systemsSIGPLAN Not.201651451753010.1145/2954679.2872374
– reference: Ropars, T., Lefray, A., Kim, D., Schiper, A.: Efficient process replication for MPI applications: Sharing work between replicas. In: 2015 IEEE International Parallel and Distributed Processing Symposium, pp. 645–654, (2015). https://doi.org/10.1109/IPDPS.2015.29
– reference: FischerMJKarpinskiMThe consensus problem in unreliable distributed systems (a brief survey)Foundations of Computation Theory1983Berlin HeidelbergSpringer12714010.1007/3-540-12689-9_99
– reference: Miguel CastroBLPractical byzantine fault tolerance and proactive recoveryACM Trans. Comput. Syst.200220439846110.1145/571637.571640
– reference: SnirMWisniewskiRWAbrahamJAAdveSVBagchiSBalajiPBelakJBosePCappelloFCarlsonBChienAACoteusPDebardelebenNADinizPCEngelmannCErezMFazzariSGeistAGuptaRJohnsonFKrishnamoorthySLeyfferSLibertyDMitraSMunsonTSchreiberRStearleyJHensbergenEVAddressing failures in exascale computingInt. J. High Perform. Comput. Appl.201428212917310.1177/1094342014522573
– reference: Bosilca, G., Bouteiller, A., Herault, T., Le Fèvre, V., Robert, Y., Dongarra, J.: Revisiting credit distribution algorithms for distributed termination detection. In: 2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 611–620 (2021). https://doi.org/10.1109/IPDPSW52791.2021.00095
– reference: Ongaro, D., Ousterhout, J.: In search of an understandable consensus algorithm. In: Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference. USENIX Association, USENIX ATC’14, pp. 305-320, USA (2014)
– reference: DolevDReischukRBounds on information exchange for byzantine agreementJ. ACM (JACM)198532119120483233810.1145/2455.2141120629.68026
– reference: Costa, C. H. A., Park, Y., Rosenburg, B. S., Cher, C.-Y., Ryu, K. D.: A system software approach to proactive memory-error avoidance. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’14, pp. 707–718. IEEE Press, (2014). https://doi.org/10.1109/SC.2014.63
– reference: LosadaNGonzálezPMartínMJBosilcaGBouteillerATeranishiKFault tolerance of MPI applications in exascale systems: the ULFM solutionFuture Gener. Comput. Syst.202010646748110.1016/j.future.2020.01.026
– reference: Martino, C. D., Jha, S., Kramer, W., Kalbarczyk, Z., Iyer, R. K.: Logdiver: A tool for measuring resilience of extreme-scale systems and applications. In: Proceedings of the 5th Workshop on Fault Tolerance for HPC at EXtreme Scale, pp. 11–18. Association for Computing Machinery, FTXS ’15, New York, (2015). https://doi.org/10.1145/2751504.2751511
– reference: Popov, S.: The tangle. White Paper 1(3) (2018)
– reference: Al-Mamun, A., Li, T., Sadoghi, M., Jiang, L., Shen, H.-T., Zhao, D.: Hpchain: an mpi-based blockchain framework for data fidelity in high-performance computing systems (2019)
– reference: Guo, H., Di, S., Gupta, R., Peterka, T., Cappello, F.: La VALSE: scalable log visualization for fault characterization in supercomputers. In: Childs, H., Cucchietti, F. (eds.) Eurographics Symposium on Parallel Graphics and Visualization. The Eurographics Association (2018)
– reference: Ferreira, K., Stearley, J., Laros, J. H., Oldfield, R., Pedretti, K., Brightwell, R., Riesen, R., Bridges, P. G., Arnold, D.: Evaluating the viability of process replication reliability for exascale systems. In: SC ’11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–12. (2011b). https://doi.org/10.1145/2063384.2063443
– reference: El-Sayed, N., Schroeder, B.: Reading between the lines of failure logs: Understanding how hpc systems fail. In: 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 1–12. IEEE, (2013)
– reference: LamportLShostakREPeaseMCThe byzantine generals problemACM Trans. Program. Lang. Syst.19824338240110.1145/357172.3571760483.68021
– reference: Altarawneh, A., Herschberg, T., Medury, S., Kandah, F., Skjellum, A.: Buterin’s scalability trilemma viewed through a state-change-based classification for common consensus algorithms. In: 2020 10th Annual Computing and Communication Workshop and Conference (CCWC), pp. 0727–0736 (2020). https://doi.org/10.1109/CCWC47524.2020.9031204
– reference: Giménez, A., Gamblin, T., Bhatele, A., Wood, C., Shoga, K., Marathe, A., Bremer, P.-T., Hamann, B., Schulz, M.: Scrubjay: Deriving knowledge from the disarray of hpc performance data. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. Association for Computing Machinery, SC ’17, New York, (2017). https://doi.org/10.1145/3126908.3126935
– reference: DworkCNaorMBrickellEFPricing via processing or combatting junk mailAdvances in Cryptology – CRYPTO’ 921993Berlin HeidelbergSpringer13914710.1007/3-540-48071-4_10
– reference: Al-Mamun, A., Zhao, D.: BAASH: enabling blockchain-as-a-service on high-performance computing systems. CoRR Preprint at arxiv: 2001.07022 (2020)
– reference: LamportLThe weak byzantine generals problemJ. ACM198330366867670983910.1145/2402.3223980627.68026
– reference: Hassani, A., Skjellum, A., Brightwell, R.: Design and evaluation of FA-MPI, a transactional resilience scheme for non-blocking MPI. In: 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pp. 750–755, (2014). https://doi.org/10.1109/DSN.2014.78
– reference: Omwenga, M., Otim, J., Lumala, A.: Robust mobile cloud services through offline support, pp. 90–93 (2012). https://doi.org/10.1109/ACSEAC.2012.27
– reference: Katti, A., Di Fatta, G., Naughton, T., Engelmann, C.: Scalable and fault tolerant failure detection and consensus. In: Proceedings of the 22nd European MPI Users’ Group Meeting. Association for Computing Machinery, EuroMPI ’15, New York, (2015) https://doi.org/10.1145/2802658.2802660
– reference: Huang, S.-T.: Detecting termination of distributed computations by external agents. In: [1989] Proceedings. The 9th International Conference on Distributed Computing Systems, pp. 79–84, (1989). https://doi.org/10.1109/ICDCS.1989.37933
– reference: Baudet, M., Ching, A., Chursin, A., Danezis, G., Garillot, F., Li, Z., Malkhi, D., Naor, O., Perelman, D., Sonnino, A.: State machine replication in the libra blockchain (2019)
– reference: Ferreira, K., Stearley, J., Laros, J. H., Oldfield, R., Pedretti, K., Brightwell, R., Riesen, R., Bridges, P. G., Arnold, D.: Evaluating the viability of process replication reliability for exascale systems. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. Association for Computing Machinery, SC ’11, New York (2011a). https://doi.org/10.1145/2063384.2063443
– reference: Buntinas, D.: Scalable distributed consensus to support mpi fault tolerance. In : 2012 IEEE 26th International Parallel and Distributed Processing Symposium, pp. 1240–1249 (2012). https://doi.org/10.1109/IPDPS.2012.113
– reference: Bano, S., Sonnino, A., Al-Bassam, M., Azouvi, S., McCorry, P., Meiklejohn, S., Danezis, G.: Sok: Consensus in the age of blockchains. In: Proceedings of the 1st ACM Conference on Advances in Financial Technologies, pp. 183–198 (2019)
– reference: SultanaNRüfenachtMSkjellumALagunaIMohrorKFailure recovery for bulk synchronous applications with MPI stagesParallel Comput.20198411410.1016/j.parco.2019.02.007
– reference: Moise, I.: Efficient agreement protocols in asynchronous distributed systems. In: 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum, pp. 2022–2025. IEEE, (2011)
– reference: Forum, M.P.I.: MPI: A Message-passing Interface Standard, Version 3.1. (2015). High-Performance Computing Center Stuttgart, University of Stuttgart, (2015). URL https://books.google.com/books?id=Fbv7jwEACAAJ
– reference: StoneJPartridgeCWhen the CRC and TCP checksum disagreeSIGCOMM Comput. Commun. Rev.200030430931910.1145/347057.347561
– reference: Duan, S.: Building reliable and practical byzantine fault tolerance. PhD dissertation, University of California Davis (2016)
– reference: Darius, B.: Scalable distributed consensus to support mpi fault tolerance. In: 2012 IEEE 26th International Parallel and Distributed Processing Symposium, pp. 1240–1249. IEEE, (2012)
– reference: Driscoll, K., Hall, B., Paulitsch, M., Zumsteg, P., Sivencrona, H.: The real byzantine generals. In: The 23rd Digital Avionics Systems Conference (IEEE Cat. No.04CH37576), vol. 2, pp. 6.D.4–61 (2004). https://doi.org/10.1109/DASC.2004.1390734
– reference: BoydSParikhNChuEPeleatoBEcksteinJDistributed optimization and statistical learning via the alternating direction method of multipliersFound. Trends Mach. Learn.201131112210.1561/22000000161229.90122
– reference: BrokawTKoziukGThe intelligent platform management interface (IPMI) and enclosure managementElectron. Eng. (Lond.)20007219
– reference: Fan, X., Chai, Q.: Roll-dpos: A randomized delegated proof of stake scheme for scalable blockchain-based internet of things systems. In: Proceedings of the 15th EAI International Conference on Mobile and Ubiquitous Systems: Computing, Networking and Services. MobiQuitous ’18, pp. 482–484. New York (2018). https://doi.org/10.1145/3286978.3287023
– reference: Leners, J.B., Wu, H., Hung, W.-L., Aguilera, M.K, Walfish, M.: Detecting failures in distributed systems with the falcon spy network, In: Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, pp. 279–294. New York, NY, Association for Computing Machinery (2011). https://doi.org/10.1145/2043556.2043583
– reference: Castro, M., Liskov, B.: Practical byzantine fault tolerance. In: Proceedings of the Third USENIX Symposium on Operating Systems Design and Implementation (OSDI), New Orleans, Louisiana, USA, pp. 173–186, (1999). URL https://dl.acm.org/citation.cfm?id=296824
– reference: Yin, M., Malkhi, D., Reiter, M. K., Gueta, G. G., Abraham, I.: Hotstuff: Bft consensus with linearity and responsiveness. In: Proceedings of the 2019 ACM Symposium on Principles of Distributed Computing, pp. 347–356. ACM (2019)
– reference: Woo, S., Lang, S., Latham, R., Ross, R., Thakur, R.: Reliable MPI-IO through layout-aware replication (2011)
– reference: Cachin, C., Vukolić, M.: Blockchain consensus protocols in the wild. arXiv preprint arXiv:1707.01873, (2017)
– reference: MosesYRaynalMRevisiting simultaneous consensus with crash failuresJ. Parallel Distrib. Comput.200969440040910.1016/j.jpdc.2009.01.001
– reference: HurseyJNaughtonTValleeGGrahamRLCotronisYDanalisANikolopoulosDSDongarraJA log-scaling fault tolerant agreement algorithm for a fault tolerant MPIRecent Advances in the Message Passing Interface2011Berlin HeidelbergSpringer25526310.1007/978-3-642-24449-0_29
– reference: Aguilera, M. K., Toueg, S.: Randomization and failure detection: a hybrid approach to solve consensus. Technical report (1996)
– reference: Bar-NoyADolevDConsensus algorithms with one-bit messagesDistrib. Comput.199143105110109725310.1007/BF017989570723.68012
– reference: SankaranSSquyresJMBarrettBSahayVLumsdaineADuellJHargrovePRomanEThe lam/mpi checkpoint/restart framework: system-initiated checkpointingInt. J. High Perform. Comput. Appl.200519447949310.1177/1094342005056139
– reference: Hassani, A., Skjellum, A., Bangalore, P. V., Brightwell, R.: Practical resilient cases for fa-mpi, a transactional fault-tolerant mpi. In: Proceedings of the 3rd Workshop on Exascale MPI. Association for Computing Machinery, ExaMPI ’15, New York (2015). https://doi.org/10.1145/2831129.2831130
– reference: King, S., Nadal, S.: Ppcoin: Peer-to-peer crypto-currency with proof-of-stake. self-published paper, (2012)
– volume: 16
  start-page: 133
  issue: 2
  year: 1998
  ident: 749_CR41
  publication-title: ACM Trans. Comput. Syst.
  doi: 10.1145/279227.279229
– volume: 7
  start-page: 337
  issue: 4
  year: 2009
  ident: 749_CR56
  publication-title: IEEE Trans. Depend. Secur. Comput.
  doi: 10.1109/TDSC.2009.4
– volume: 69
  start-page: 400
  issue: 4
  year: 2009
  ident: 749_CR47
  publication-title: J. Parallel Distrib. Comput.
  doi: 10.1016/j.jpdc.2009.01.001
– ident: 749_CR31
– ident: 749_CR12
– ident: 749_CR26
  doi: 10.1145/2802658.2802660
– start-page: 139
  volume-title: Advances in Cryptology – CRYPTO’ 92
  year: 1993
  ident: 749_CR32
  doi: 10.1007/3-540-48071-4_10
– ident: 749_CR28
  doi: 10.1109/ISNCC49221.2020.9297326
– ident: 749_CR14
  doi: 10.1109/DASC.2004.1390734
– ident: 749_CR42
  doi: 10.1145/167088.167119
– start-page: 912
  volume-title: Programming Languages and Systems
  year: 2018
  ident: 749_CR66
  doi: 10.1007/978-3-319-89884-1_32
– ident: 749_CR19
– ident: 749_CR46
  doi: 10.1145/2043556.2043583
– volume: 84
  start-page: 1
  year: 2019
  ident: 749_CR3
  publication-title: Parallel Comput.
  doi: 10.1016/j.parco.2019.02.007
– ident: 749_CR21
– ident: 749_CR60
  doi: 10.1109/ICDCS.1989.37933
– volume: 64
  start-page: 3118
  issue: 12
  year: 2016
  ident: 749_CR23
  publication-title: IEEE Trans. Signal Process.
  doi: 10.1109/TSP.2016.2537271
– ident: 749_CR40
– volume: 32
  start-page: 191
  issue: 1
  year: 1985
  ident: 749_CR6
  publication-title: J. ACM (JACM)
  doi: 10.1145/2455.214112
– ident: 749_CR29
– volume: 19
  start-page: 479
  issue: 4
  year: 2005
  ident: 749_CR38
  publication-title: Int. J. High Perform. Comput. Appl.
  doi: 10.1177/1094342005056139
– volume: 4
  start-page: 382
  issue: 3
  year: 1982
  ident: 749_CR51
  publication-title: ACM Trans. Program. Lang. Syst.
  doi: 10.1145/357172.357176
– ident: 749_CR48
– ident: 749_CR55
  doi: 10.1109/IPDPS.2012.113
– ident: 749_CR58
  doi: 10.1109/ACSEAC.2012.27
– start-page: 127
  volume-title: Foundations of Computation Theory
  year: 1983
  ident: 749_CR37
  doi: 10.1007/3-540-12689-9_99
– ident: 749_CR1
  doi: 10.1109/ICDCS.1999.776549
– ident: 749_CR35
  doi: 10.1109/IPDPS.2011.367
– ident: 749_CR53
  doi: 10.1145/3286978.3287023
– ident: 749_CR15
– ident: 749_CR2
  doi: 10.1109/DSN.2014.78
– ident: 749_CR45
  doi: 10.1109/CCGRID.2017.18
– volume: 28
  start-page: 129
  issue: 2
  year: 2014
  ident: 749_CR54
  publication-title: Int. J. High Perform. Comput. Appl.
  doi: 10.1177/1094342014522573
– volume: 27
  start-page: 244
  issue: 3
  year: 2013
  ident: 749_CR25
  publication-title: Int. J. High Perform. Comput. Appl.
  doi: 10.1177/1094342013488238
– volume: 35
  start-page: 288
  issue: 2
  year: 1988
  ident: 749_CR65
  publication-title: J. ACM
  doi: 10.1145/42282.42283
– ident: 749_CR5
  doi: 10.1109/CCWC47524.2020.9031204
– ident: 749_CR24
  doi: 10.1145/3293611.3331591
– volume: 4
  start-page: 105
  issue: 3
  year: 1991
  ident: 749_CR16
  publication-title: Distrib. Comput.
  doi: 10.1007/BF01798957
– volume: 3
  start-page: 1
  issue: 1
  year: 2011
  ident: 749_CR61
  publication-title: Found. Trends Mach. Learn.
  doi: 10.1561/2200000016
– ident: 749_CR49
  doi: 10.1007/3-540-61769-8_3
– ident: 749_CR34
  doi: 10.1109/IPDPSW52791.2021.00095
– volume: 30
  start-page: 309
  issue: 4
  year: 2000
  ident: 749_CR59
  publication-title: SIGCOMM Comput. Commun. Rev.
  doi: 10.1145/347057.347561
– ident: 749_CR22
  doi: 10.1145/2063384.2063443
– ident: 749_CR4
– volume: 72
  start-page: 19
  year: 2000
  ident: 749_CR43
  publication-title: Electron. Eng. (Lond.)
– ident: 749_CR44
  doi: 10.1109/SC.2014.63
– volume: 48
  start-page: 24
  year: 2017
  ident: 749_CR11
  publication-title: Theor. Appl. Sci.
  doi: 10.15863/TAS.2017.04.48.5
– ident: 749_CR63
  doi: 10.1145/2831129.2831130
– ident: 749_CR10
  doi: 10.1109/IPDPS.2012.113
– ident: 749_CR50
  doi: 10.1145/3458817.3476155
– ident: 749_CR52
– volume: 106
  start-page: 467
  year: 2020
  ident: 749_CR62
  publication-title: Future Gener. Comput. Syst.
  doi: 10.1016/j.future.2020.01.026
– year: 2019
  ident: 749_CR20
  publication-title: Symmetry
  doi: 10.3390/sym11101198
– ident: 749_CR8
– volume: 20
  start-page: 398
  issue: 4
  year: 2002
  ident: 749_CR67
  publication-title: ACM Trans. Comput. Syst.
  doi: 10.1145/571637.571640
– ident: 749_CR27
– volume: 51
  start-page: 517
  issue: 4
  year: 2016
  ident: 749_CR57
  publication-title: SIGPLAN Not.
  doi: 10.1145/2954679.2872374
– start-page: 255
  volume-title: Recent Advances in the Message Passing Interface
  year: 2011
  ident: 749_CR64
  doi: 10.1007/978-3-642-24449-0_29
– ident: 749_CR9
  doi: 10.1145/2751504.2751511
– volume: 30
  start-page: 668
  issue: 3
  year: 1983
  ident: 749_CR33
  publication-title: J. ACM
  doi: 10.1145/2402.322398
– ident: 749_CR13
– ident: 749_CR30
– ident: 749_CR17
– ident: 749_CR39
  doi: 10.1145/2063384.2063443
– ident: 749_CR18
  doi: 10.1109/DSN.2013.6575356
– ident: 749_CR7
  doi: 10.1145/3126908.3126935
– ident: 749_CR36
  doi: 10.1109/IPDPS.2015.29
SSID ssj0009788
Score 2.2906346
Snippet Large-scale HPC systems experience failures arising from faults in hardware, software, and/or networking. Failure rates continue to grow as systems scale up...
SourceID proquest
crossref
springer
SourceType Aggregation Database
Enrichment Source
Index Database
Publisher
StartPage 128
SubjectTerms Classification
Computer Science
Crashes
Failure rates
Fault detection
Fault tolerance
Faults
Message passing
Processor Architectures
Software Engineering/Programming and Operating Systems
Special Issue on High-Level Parallel Programming and Applications (HLPP 2022)
Synchronism
System failures
Theory of Computation
SummonAdditionalLinks – databaseName: ABI/INFORM Global
  dbid: M0C
  link: http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV3PS8MwFA46PXhx_sTplBy8abA2TZOcZAzHPHQMUditNGkKwuzm2gn7783rUouCu3jppU0o-fLyXpL3vg-ha2WMdXSSEqloQAKuEqKE4SThWqQJC1VIVSU2wUcjMZnIsTtwK1xaZb0mVgt1OtNwRn4HRHcgpcTow_yDgGoU3K46CY1ttAORDaT0RV6_Id3lle6kNSRGeMCEK5pxpXM8hPRbn4AXlWT10zE10eavC9LK7wza__3jA7TvIk7cW0-RQ7Rl8iPUrtUcsDPuYxT18CBZTksC8mhT8gyF53bYcSWbCQlFFYZ4lmEQ-QSFjAJHBgqH34r3AtvgF0fjJ5zkKR6O-yfodfD40h8SJ7ZAtLXCkmjfaEWZoUqmmRFKWsuG7ZjxfG23jL6ygU1GZcBpmDLDgKXeZ9TLRBpKAzmqp6iVz3JzhrChwpdBkISKySCxEYGxD8ozIZNAeEJ20H090rF2TOQgiDGNGw5lQCe26MQVOvGqg26-28zXPBwbv-7WkMTOJou4waODbmtQm9d_93a-ubcLtAca9Ov8sS5qlYuluUS7-rN8KxZX1Yz8Aokd45E
  priority: 102
  providerName: ProQuest
Title A Fault-Model-Relevant Classification of Consensus Mechanisms for MPI and HPC
URI https://link.springer.com/article/10.1007/s10766-022-00749-y
https://www.proquest.com/docview/2792728053
Volume 51
WOSCitedRecordID wos000897136200002&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVAVX
  databaseName: Springer LINK
  customDbUrl:
  eissn: 1573-7640
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0009788
  issn: 0885-7458
  databaseCode: RSV
  dateStart: 19970101
  isFulltext: true
  titleUrlDefault: https://link.springer.com/search?facet-content-type=%22Journal%22
  providerName: Springer Nature
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV3dS8MwED90-uCL8xOnc-TBNw3MpmmSxzk2FOks89uX0rQpCHPKOgX_e3Nda1VU0JeD0iSUu1zvQu5-P4A9bYwNdIpRpZlLXaEjqqURNBKxTCLuaY_pnGxCDAby5kYFRVNYVla7l1eS-Z_6Q7Ob8LBg1qEY9xR9nYcFG-4kEjYMz68qqF2Rs01a9-FUuFwWrTLfr_E5HFU55pdr0Tza9Ov_-84VWC6yS9KZbYdVmDPjNaiXzA2kcOR18DukHz2PphSp0EZ0iE3mVsUkp8jE4qHcXuQxJUjoiWwYGfENNgnfZw8ZsYku8YMTEo0Tchx0N-Cy37voHtOCWIHG1uOmNHZMrBk3TKskNVIr68V49DJtJ7bHQ0fbJCZlyhXMS7jhiEjvcNZOZeIpg_Wom1AbP47NFhDDpKNcN_I0V25ko7-xgolUqsiVbakacFjqN4wL1HEkvxiFFV4y6iu0-gpzfYWvDdh_n_M0w9z4dXSzNFtY-F8WIiwiEm9x1oCD0kzV659X2_7b8B1YQv75We1YE2rTybPZhcX4ZXqfTVowL65vW7Bw1BsEQ_t0KqiVfruL0jmzMuB3rXz3vgFahuDH
linkProvider Springer Nature
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMw1V1LT9tAEB4BrQSXQksRoZTuAU6watiHd_dQIUSJEoVEEQoSN-O111KkkIQ4UOVP9Td2x7GxQIJbDr34Yu9I9nzzWO_MfACH1jkf6AynxnJBhbIRtdopGqlYJ5EMbMBtTjahul19e2t6K_C37IXBssrSJ-aOOhnH-I_8Jw66Qyolyc8mDxRZo_B0taTQWMCi7eZ__JYt-9X67fV7xFjjsn_RpAWrAI093GY0Zi62XDpuTZI6bY2HMO47XJ3Ffm_ErI_gKTdC8SCRTuI4diZ5PdVJYBwWY3q5q_BBcK3QrtqKVkN-Vc5z6Q1XUiWkLpp0ilY9FWC5L6MYtQ2dvwyEVXb76kA2j3ONzf_tC23BpyKjJucLE_gMK270BTZLtgpSOK9t6JyTRvQ4nFGkfxvSa2ys97AiOS0oFkzlGCXjlCCJKTKAZKTjsDF6kN1nxCf3pNNrkWiUkGbv4ivcLOWldmBtNB65XSCOa2aEiAIrjYh8xuP8hatUm0joujY1OC01G8bFpHUk_BiG1YxoREPo0RDmaAjnNTh-XjNZzBl59-n9EgJh4XOysNJ_DU5KEFW335a29760H7De7HeuwqtWt_0NNpjP8ha1cvuwNps-uu_wMX6aDbLpQW4NBO6WDa5_gWU96g
linkToPdf http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMw1V1NT9wwEB0BRYhLKR8VS6H1oT2BxWLHsX1ACEFXrGBXEQIJcQlx4khI210gS9H-NX5dZ7IJUZHKjUMvuSS2lPh5PuI38wC-O-_R0VnJrZMBD7RLuDNe80SnJktU6ELpSrEJ3e-bqysbzcBzXQtDtMraJpaGOhul9I98lxrdkZSSkrt5RYuIjjsHd_ecFKTopLWW05hC5NRPnjB9K_a7x7jWP4To_Lw4OuGVwgBPEXpjngqfOqm8dDbLvXEW4Uw5iG-LFPMk4dCb59IGWoaZ8opaswsl27nJQuuJmInzzsIHjTkm0Qkjdd00_NWl5iVuYsV1oExVsFOV7emQqL-Ckwe3fPK3U2wi3VeHs6XP6yz9z1_rE3ysIm12ON0ayzDjhyuwVKtYsMqorULvkHWSx8GYkyzcgJ9TwT3CjZVyoUSkKrHLRjkjcVNSBilYz1PB9G3xq2AY9LNe1GXJMGMn0dEaXL7LS32GueFo6NeBeWmEDYIkdMoGCUZCHi9S58YmgWkb24K9epXjtOrATkIgg7jpHU3IiBEZcYmMeNKC7Zcxd9P-I28-vVnDIa5sURE3WGjBTg2o5va_Z9t4e7ZvsICYis-6_dMvsCgw-JtS6DZhbvzw6LdgPv09vi0evpYbg8HNe2PrDwzNRw4
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=A+Fault-Model-Relevant+Classification+of+Consensus+Mechanisms+for+MPI+and+HPC&rft.jtitle=International+journal+of+parallel+programming&rft.au=Nansamba%2C+Grace&rft.au=Altarawneh%2C+Amani&rft.au=Skjellum%2C+Anthony&rft.date=2023-06-01&rft.pub=Springer+US&rft.issn=0885-7458&rft.eissn=1573-7640&rft.volume=51&rft.issue=2-3&rft.spage=128&rft.epage=149&rft_id=info:doi/10.1007%2Fs10766-022-00749-y&rft.externalDocID=10_1007_s10766_022_00749_y
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0885-7458&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0885-7458&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0885-7458&client=summon