A Fault-Model-Relevant Classification of Consensus Mechanisms for MPI and HPC
Large-scale HPC systems experience failures arising from faults in hardware, software, and/or networking. Failure rates continue to grow as systems scale up and out. Crash fault tolerance has up to now been the focus when considering means to augment the Message Passing Interface (MPI) for fault-tol...
Uloženo v:
| Vydáno v: | International journal of parallel programming Ročník 51; číslo 2-3; s. 128 - 149 |
|---|---|
| Hlavní autoři: | , , |
| Médium: | Journal Article |
| Jazyk: | angličtina |
| Vydáno: |
New York
Springer US
01.06.2023
Springer Nature B.V |
| Témata: | |
| ISSN: | 0885-7458, 1573-7640 |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Abstract | Large-scale HPC systems experience failures arising from faults in hardware, software, and/or networking. Failure rates continue to grow as systems scale up and out. Crash fault tolerance has up to now been the focus when considering means to augment the Message Passing Interface (MPI) for fault-tolerant operations. This narrow model of faults (usually restricted only to process or node failures) is insufficient. Without a more general model for consensus, gaps in the ability to detect, isolate, mitigate, and recover HPC applications efficiently will arise. Focusing on crash failures is insufficient because a chain of underlying components may lead to system failures in MPI. What is more, clusters and leadership-class machines alike often have Reliability, Availability, and Serviceability Systems to convey predictive and real-time fault and error information, which does not map strictly to process and node crashes. A broader study of failures beyond crash failures in MPI will thus be useful in conjunction with consensus mechanism for developers as they continue to design, develop, and implement fault-tolerant HPC systems that reflect observable faults in actual systems. We describe key factors that must be considered during consensus-mechanism design. We illustrate some of the current MPI fault tolerance models based on use cases. We offer a novel classification of common consensus mechanisms based on these factors such as the network model, failure types, and based on use cases (e.g., fault detection, synchronization) of the consensus in the computation process, including crash fault tolerance as one category. |
|---|---|
| AbstractList | Large-scale HPC systems experience failures arising from faults in hardware, software, and/or networking. Failure rates continue to grow as systems scale up and out. Crash fault tolerance has up to now been the focus when considering means to augment the Message Passing Interface (MPI) for fault-tolerant operations. This narrow model of faults (usually restricted only to process or node failures) is insufficient. Without a more general model for consensus, gaps in the ability to detect, isolate, mitigate, and recover HPC applications efficiently will arise. Focusing on crash failures is insufficient because a chain of underlying components may lead to system failures in MPI. What is more, clusters and leadership-class machines alike often have Reliability, Availability, and Serviceability Systems to convey predictive and real-time fault and error information, which does not map strictly to process and node crashes. A broader study of failures beyond crash failures in MPI will thus be useful in conjunction with consensus mechanism for developers as they continue to design, develop, and implement fault-tolerant HPC systems that reflect observable faults in actual systems. We describe key factors that must be considered during consensus-mechanism design. We illustrate some of the current MPI fault tolerance models based on use cases. We offer a novel classification of common consensus mechanisms based on these factors such as the network model, failure types, and based on use cases (e.g., fault detection, synchronization) of the consensus in the computation process, including crash fault tolerance as one category. |
| Author | Skjellum, Anthony Altarawneh, Amani Nansamba, Grace |
| Author_xml | – sequence: 1 givenname: Grace surname: Nansamba fullname: Nansamba, Grace email: jpp751@mocs.utc.edu organization: University of Tennessee at Chattanooga – sequence: 2 givenname: Amani surname: Altarawneh fullname: Altarawneh, Amani organization: Colorado State University – sequence: 3 givenname: Anthony surname: Skjellum fullname: Skjellum, Anthony organization: University of Tennessee at Chattanooga |
| BookMark | eNp9kE9LwzAYh4NMcJt-AU8Bz9E0aZrkOIpzgxWH6DmkbaIdXTKTVti3t66C4GHk8BL4Pe-fZwYmzjsDwG2C7xOM-UNMMM8yhAlBwzeV6HgBpgnjFPEsxRMwxUIwxFMmrsAsxh3GWHIhpqBYwKXu2w4VvjYtejGt-dKug3mrY2xsU-mu8Q56C3PvonGxj7Aw1Yd2TdxHaH2AxXYNtavhaptfg0ur22hufuscvC0fX_MV2jw_rfPFBlU0kR2qiKlKygwtZW2NKGUqBBmewaRK04yUQlJLZcppVjPDMBeSMIqtqDNpJKN0Du7GvofgP3sTO7XzfXDDSEW4JJwIfEqJMVUFH2MwVlVNd7qnC7ppVYLVjzw1ylODPHWSp44DSv6hh9DsdTieh-gIxSHs3k342-oM9Q0wt4KF |
| CitedBy_id | crossref_primary_10_1007_s11227_025_07503_4 |
| Cites_doi | 10.1145/279227.279229 10.1109/TDSC.2009.4 10.1016/j.jpdc.2009.01.001 10.1145/2802658.2802660 10.1007/3-540-48071-4_10 10.1109/ISNCC49221.2020.9297326 10.1109/DASC.2004.1390734 10.1145/167088.167119 10.1007/978-3-319-89884-1_32 10.1145/2043556.2043583 10.1016/j.parco.2019.02.007 10.1109/ICDCS.1989.37933 10.1109/TSP.2016.2537271 10.1145/2455.214112 10.1177/1094342005056139 10.1145/357172.357176 10.1109/IPDPS.2012.113 10.1109/ACSEAC.2012.27 10.1007/3-540-12689-9_99 10.1109/ICDCS.1999.776549 10.1109/IPDPS.2011.367 10.1145/3286978.3287023 10.1109/DSN.2014.78 10.1109/CCGRID.2017.18 10.1177/1094342014522573 10.1177/1094342013488238 10.1145/42282.42283 10.1109/CCWC47524.2020.9031204 10.1145/3293611.3331591 10.1007/BF01798957 10.1561/2200000016 10.1007/3-540-61769-8_3 10.1109/IPDPSW52791.2021.00095 10.1145/347057.347561 10.1145/2063384.2063443 10.1109/SC.2014.63 10.15863/TAS.2017.04.48.5 10.1145/2831129.2831130 10.1145/3458817.3476155 10.1016/j.future.2020.01.026 10.3390/sym11101198 10.1145/571637.571640 10.1145/2954679.2872374 10.1007/978-3-642-24449-0_29 10.1145/2751504.2751511 10.1145/2402.322398 10.1109/DSN.2013.6575356 10.1145/3126908.3126935 10.1109/IPDPS.2015.29 |
| ContentType | Journal Article |
| Copyright | The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2022. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law. |
| Copyright_xml | – notice: The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2022. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law. |
| DBID | AAYXX CITATION 3V. 7SC 7WY 7WZ 7XB 87Z 8AL 8FD 8FE 8FG 8FK 8FL 8G5 ABUWG AFKRA ARAPS AZQEC BENPR BEZIV BGLVJ CCPQU DWQXO FRNLG F~G GNUQQ GUQSH HCIFZ JQ2 K60 K6~ K7- L.- L7M L~C L~D M0C M0N M2O MBDVC P5Z P62 PHGZM PHGZT PKEHL PQBIZ PQBZA PQEST PQGLB PQQKQ PQUKI PRINS Q9U |
| DOI | 10.1007/s10766-022-00749-y |
| DatabaseName | CrossRef ProQuest Central (Corporate) Computer and Information Systems Abstracts ABI/INFORM Collection ABI/INFORM Global (PDF only) ProQuest Central (purchase pre-March 2016) ABI/INFORM Collection Computing Database (Alumni Edition) Technology Research Database ProQuest SciTech Collection ProQuest Technology Collection ProQuest Central (Alumni) (purchase pre-March 2016) ABI/INFORM Collection (Alumni Edition) Research Library (Alumni Edition) ProQuest Central (Alumni) ProQuest Central UK/Ireland Advanced Technologies & Computer Science Collection ProQuest Central Essentials ProQuest Central Business Premium Collection Technology collection ProQuest One Community College ProQuest Central Korea Business Premium Collection (Alumni) ABI/INFORM Global (Corporate) ProQuest Central Student Research Library Prep SciTech Premium Collection ProQuest Computer Science Collection ProQuest Business Collection (Alumni Edition) ProQuest Business Collection Computer Science Database ABI/INFORM Professional Advanced Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional ABI/INFORM Global Computing Database Research Library Research Library (Corporate) Advanced Technologies & Aerospace Database ProQuest Advanced Technologies & Aerospace Collection ProQuest Central Premium ProQuest One Academic (New) ProQuest One Academic Middle East (New) ProQuest One Business ProQuest One Business (Alumni) ProQuest One Academic Eastern Edition (DO NOT USE) ProQuest One Applied & Life Sciences ProQuest One Academic (retired) ProQuest One Academic UKI Edition ProQuest Central China ProQuest Central Basic |
| DatabaseTitle | CrossRef ABI/INFORM Global (Corporate) ProQuest Business Collection (Alumni Edition) ProQuest One Business Research Library Prep Computer Science Database ProQuest Central Student Technology Collection Technology Research Database Computer and Information Systems Abstracts – Academic ProQuest One Academic Middle East (New) ProQuest Advanced Technologies & Aerospace Collection ProQuest Central Essentials ProQuest Computer Science Collection Computer and Information Systems Abstracts ProQuest Central (Alumni Edition) SciTech Premium Collection ProQuest One Community College Research Library (Alumni Edition) ProQuest Central China ABI/INFORM Complete ProQuest Central ABI/INFORM Professional Advanced ProQuest One Applied & Life Sciences ProQuest Central Korea ProQuest Research Library ProQuest Central (New) Advanced Technologies Database with Aerospace ABI/INFORM Complete (Alumni Edition) Advanced Technologies & Aerospace Collection Business Premium Collection ABI/INFORM Global ProQuest Computing ABI/INFORM Global (Alumni Edition) ProQuest Central Basic ProQuest Computing (Alumni Edition) ProQuest One Academic Eastern Edition ProQuest Technology Collection ProQuest SciTech Collection ProQuest Business Collection Computer and Information Systems Abstracts Professional Advanced Technologies & Aerospace Database ProQuest One Academic UKI Edition ProQuest One Business (Alumni) ProQuest One Academic ProQuest One Academic (New) ProQuest Central (Alumni) Business Premium Collection (Alumni) |
| DatabaseTitleList | ABI/INFORM Global (Corporate) |
| Database_xml | – sequence: 1 dbid: BENPR name: ProQuest Central url: https://www.proquest.com/central sourceTypes: Aggregation Database |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Computer Science |
| EISSN | 1573-7640 |
| EndPage | 149 |
| ExternalDocumentID | 10_1007_s10766_022_00749_y |
| GrantInformation_xml | – fundername: National Science Foundation grantid: CCF-1562659; CCF-1562306; CCF-1617690; CCF-1822191; CCF-1821431 funderid: http://dx.doi.org/10.13039/100000001 |
| GroupedDBID | -4Z -59 -5G -BR -EM -Y2 -~C -~X .4S .86 .DC .VR 06D 0R~ 0VY 199 1N0 2.D 203 28- 29J 2J2 2JN 2JY 2KG 2LR 2P1 2VQ 2~H 30V 3V. 4.4 406 408 409 40D 40E 5GY 5QI 5VS 67Z 6NX 78A 7WY 8FE 8FG 8FL 8G5 8TC 8UJ 95- 95. 95~ 96X AAAVM AABHQ AACDK AAHNG AAIAL AAJBT AAJKR AANZL AAOBN AARHV AARTL AASML AATNV AATVU AAUYE AAWCG AAYIU AAYJJ AAYQN AAYTO AAYZH ABAKF ABBBX ABBXA ABDBF ABDPE ABDZT ABECU ABFSI ABFTD ABFTV ABHLI ABHQN ABJNI ABJOX ABKCH ABKTR ABMNI ABMQK ABNWP ABQBU ABQSL ABSXP ABTAH ABTEG ABTHY ABTKH ABTMW ABULA ABUWG ABWNU ABXPI ACAOD ACBXY ACDTI ACGFO ACGFS ACHSB ACHXU ACIHN ACKNC ACMDZ ACMLO ACNCT ACOKC ACOMO ACPIV ACREN ACUHS ACZOJ ADHIR ADINQ ADKNI ADKPE ADMLS ADRFC ADTPH ADURQ ADYFF ADYOE ADZKW AEAQA AEBTG AEFIE AEFQL AEGAL AEGNC AEJHL AEJRE AEKMD AEMSY AENEX AEOHA AEPYU AESKC AETLH AEVLU AEXYK AFBBN AFEXP AFGCZ AFKRA AFLOW AFQWF AFWTZ AFYQB AFZKB AGAYW AGDGC AGGDS AGJBK AGMZJ AGQEE AGQMX AGRTI AGWIL AGWZB AGYKE AHAVH AHBYD AHKAY AHSBF AHYZX AIAKS AIGIU AIIXL AILAN AITGF AJBLW AJRNO AJZVZ ALMA_UNASSIGNED_HOLDINGS ALWAN AMKLP AMTXH AMXSW AMYLF AOCGG ARAPS ARCSS ARMRJ AXYYD AYJHY AZFZN AZQEC B-. B0M BA0 BBWZM BDATZ BENPR BEZIV BGLVJ BGNMA BKOMP BPHCQ BSONS CAG CCPQU COF CS3 CSCUP DDRTE DL5 DNIVK DPUIP DU5 DWQXO E.L EAD EAP EAS EBLON EBS EDO EIOEI EJD EMK EPL ESBYG ESX FEDTE FERAY FFXSO FIGPU FINBP FNLPD FRNLG FRRFC FSGXE FWDCC GGCAI GGRSB GJIRD GNUQQ GNWQR GQ6 GQ7 GQ8 GROUPED_ABI_INFORM_COMPLETE GROUPED_ABI_INFORM_RESEARCH GUQSH GXS H13 HCIFZ HF~ HG5 HG6 HMJXF HQYDN HRMNR HVGLF HZ~ H~9 I-F I09 IHE IJ- IKXTQ ITM IWAJR IXC IZIGR IZQ I~X I~Z J-C J0Z JBSCW JCJTX JZLTJ K60 K6V K6~ K7- KDC KOV KOW LAK LLZTM M0C M0N M2O M4Y MA- MS~ N2Q NB0 NDZJH NPVJJ NQJWS NU0 O9- O93 O9G O9I O9J OAM OVD P19 P62 P9O PF0 PQBIZ PQBZA PQQKQ PROAC PT4 PT5 Q2X QOK QOS R89 R9I RHV RNI RNS ROL RPX RSV RZC RZE RZK S16 S1Z S26 S27 S28 S3B SAP SCJ SCLPG SCO SDH SDM SHX SISQX SJYHP SNE SNPRN SNX SOHCF SOJ SPISZ SRMVM SSLCW STPWE SZN T13 T16 TAE TEORI TN5 TSG TSK TSV TUC TUS U2A U5U UG4 UOJIU UTJUX UZXMN VC2 VFIZW VXZ W23 W48 WH7 WK8 YLTOR Z45 Z7R Z7X Z81 Z83 Z88 Z8R Z8W Z92 ZMTXR ZY4 ~8M ~EX AAPKM AAYXX ABBRH ABDBE ABFSG ABRTQ ACSTC ADHKG AEZWR AFDZB AFFHD AFHIU AFOHR AGQPQ AHPBZ AHWEU AIXLP ATHPR AYFIA CITATION PHGZM PHGZT PQGLB 7SC 7XB 8AL 8FD 8FK JQ2 L.- L7M L~C L~D MBDVC PKEHL PQEST PQUKI PRINS Q9U |
| ID | FETCH-LOGICAL-c319t-c2ecb35e3b9dfe8b94882828e02c4462b893f394736d5e507892530f8d69e9533 |
| IEDL.DBID | M0C |
| ISICitedReferencesCount | 1 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000897136200002&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 0885-7458 |
| IngestDate | Wed Nov 05 01:48:23 EST 2025 Tue Nov 18 22:01:38 EST 2025 Sat Nov 29 01:59:46 EST 2025 Fri Feb 21 02:43:33 EST 2025 |
| IsPeerReviewed | true |
| IsScholarly | true |
| Issue | 2-3 |
| Keywords | Consensus mechanisms Fault tolerance Replication Fault detection Synchronization Message-passing model |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-c319t-c2ecb35e3b9dfe8b94882828e02c4462b893f394736d5e507892530f8d69e9533 |
| Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
| PQID | 2792728053 |
| PQPubID | 48389 |
| PageCount | 22 |
| ParticipantIDs | proquest_journals_2792728053 crossref_citationtrail_10_1007_s10766_022_00749_y crossref_primary_10_1007_s10766_022_00749_y springer_journals_10_1007_s10766_022_00749_y |
| PublicationCentury | 2000 |
| PublicationDate | 20230600 2023-06-00 20230601 |
| PublicationDateYYYYMMDD | 2023-06-01 |
| PublicationDate_xml | – month: 6 year: 2023 text: 20230600 |
| PublicationDecade | 2020 |
| PublicationPlace | New York |
| PublicationPlace_xml | – name: New York |
| PublicationTitle | International journal of parallel programming |
| PublicationTitleAbbrev | Int J Parallel Prog |
| PublicationYear | 2023 |
| Publisher | Springer US Springer Nature B.V |
| Publisher_xml | – name: Springer US – name: Springer Nature B.V |
| References | Amin, H.: Toward a scalable, transactional, fault-tolerant message passing interface for petascale and exascale machines. PhD dissertation, The University of Alabama at Birmingham (2014) Fromentin, E., Raynal, M., Tronel, F.: On classes of problems in asyn- chronous distributed systems with process crashes. In: Proceedings. 19th IEEE International Conference on Distributed Computing Systems (Cat. No.99CB37003), pp. 470–477 (1999). https://doi.org/10.1109/ICDCS.1999.776549 Forum, M.P.I.: MPI: A Message-passing Interface Standard, Version 3.1. (2015). High-Performance Computing Center Stuttgart, University of Stuttgart, (2015). URL https://books.google.com/books?id=Fbv7jwEACAAJ Huang, S.-T.: Detecting termination of distributed computations by external agents. In: [1989] Proceedings. The 9th International Conference on Distributed Computing Systems, pp. 79–84, (1989). https://doi.org/10.1109/ICDCS.1989.37933 LamportLThe part-time parliamentACM Trans. Comput. Syst.199816213316910.1145/279227.2792291455.68033 Baudet, M., Ching, A., Chursin, A., Danezis, G., Garillot, F., Li, Z., Malkhi, D., Naor, O., Perelman, D., Sonnino, A.: State machine replication in the libra blockchain (2019) Martino, C. D., Jha, S., Kramer, W., Kalbarczyk, Z., Iyer, R. K.: Logdiver: A tool for measuring resilience of extreme-scale systems and applications. In: Proceedings of the 5th Workshop on Fault Tolerance for HPC at EXtreme Scale, pp. 11–18. Association for Computing Machinery, FTXS ’15, New York, (2015). https://doi.org/10.1145/2751504.2751511 LosadaNGonzálezPMartínMJBosilcaGBouteillerATeranishiKFault tolerance of MPI applications in exascale systems: the ULFM solutionFuture Gener. Comput. Syst.202010646748110.1016/j.future.2020.01.026 Aguilera, M. K., Toueg, S.: Randomization and failure detection: a hybrid approach to solve consensus. Technical report (1996) MosesYRaynalMRevisiting simultaneous consensus with crash failuresJ. Parallel Distrib. Comput.200969440040910.1016/j.jpdc.2009.01.001 Driscoll, K., Hall, B., Paulitsch, M., Zumsteg, P., Sivencrona, H.: The real byzantine generals. In: The 23rd Digital Avionics Systems Conference (IEEE Cat. No.04CH37576), vol. 2, pp. 6.D.4–61 (2004). https://doi.org/10.1109/DASC.2004.1390734 HurseyJNaughtonTValleeGGrahamRLCotronisYDanalisANikolopoulosDSDongarraJA log-scaling fault tolerant agreement algorithm for a fault tolerant MPIRecent Advances in the Message Passing Interface2011Berlin HeidelbergSpringer25526310.1007/978-3-642-24449-0_29 Omwenga, M., Otim, J., Lumala, A.: Robust mobile cloud services through offline support, pp. 90–93 (2012). https://doi.org/10.1109/ACSEAC.2012.27 SultanaNRüfenachtMSkjellumALagunaIMohrorKFailure recovery for bulk synchronous applications with MPI stagesParallel Comput.20198411410.1016/j.parco.2019.02.007 Leners, J.B., Wu, H., Hung, W.-L., Aguilera, M.K, Walfish, M.: Detecting failures in distributed systems with the falcon spy network, In: Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, pp. 279–294. New York, NY, Association for Computing Machinery (2011). https://doi.org/10.1145/2043556.2043583 Popov, S.: The tangle. White Paper 1(3) (2018) Giménez, A., Gamblin, T., Bhatele, A., Wood, C., Shoga, K., Marathe, A., Bremer, P.-T., Hamann, B., Schulz, M.: Scrubjay: Deriving knowledge from the disarray of hpc performance data. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. Association for Computing Machinery, SC ’17, New York, (2017). https://doi.org/10.1145/3126908.3126935 Libby, R.: Effective HPC hardware management and failure prediction strategy using IPMI. In: Proceedings of the Linux Symposium. Citeseer, (2003) FischerMJKarpinskiMThe consensus problem in unreliable distributed systems (a brief survey)Foundations of Computation Theory1983Berlin HeidelbergSpringer12714010.1007/3-540-12689-9_99 DworkCLynchNStockmeyerLConsensus in the presence of partial synchronyJ. ACM198835228832393525410.1145/42282.42283 NowakowskiWNetwork management software for redundant ethernet ringTheor. Appl. Sci.201748242910.15863/TAS.2017.04.48.5 LeesatapornwongsaTLukmanJFLuSGunawiHSTaxDC: a taxonomy of non-deterministic concurrency bugs in datacenter distributed systemsSIGPLAN Not.201651451753010.1145/2954679.2872374 Duan, S.: Building reliable and practical byzantine fault tolerance. PhD dissertation, University of California Davis (2016) LamportLShostakREPeaseMCThe byzantine generals problemACM Trans. Program. Lang. Syst.19824338240110.1145/357172.3571760483.68021 LamportLThe weak byzantine generals problemJ. ACM198330366867670983910.1145/2402.3223980627.68026 SankaranSSquyresJMBarrettBSahayVLumsdaineADuellJHargrovePRomanEThe lam/mpi checkpoint/restart framework: system-initiated checkpointingInt. J. High Perform. Comput. Appl.200519447949310.1177/1094342005056139 Costa, C. H. A., Park, Y., Rosenburg, B. S., Cher, C.-Y., Ryu, K. D.: A system software approach to proactive memory-error avoidance. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’14, pp. 707–718. IEEE Press, (2014). https://doi.org/10.1109/SC.2014.63 Hassani, A., Skjellum, A., Bangalore, P. V., Brightwell, R.: Practical resilient cases for fa-mpi, a transactional fault-tolerant mpi. In: Proceedings of the 3rd Workshop on Exascale MPI. Association for Computing Machinery, ExaMPI ’15, New York (2015). https://doi.org/10.1145/2831129.2831130 Woo, S., Lang, S., Latham, R., Ross, R., Thakur, R.: Reliable MPI-IO through layout-aware replication (2011) Yin, M., Malkhi, D., Reiter, M. K., Gueta, G. G., Abraham, I.: Hotstuff: Bft consensus with linearity and responsiveness. In: Proceedings of the 2019 ACM Symposium on Principles of Distributed Computing, pp. 347–356. ACM (2019) Borowsky, E., Gafni, E.: Generalized flp impossibility result fort-resilient asynchronous computations. In: Proceedings of the Twenty-Fifth Annual ACM Symposium on Theory of Computing. STOC ’93, pp. 91–100. Association for Computing Machinery, New York. ISBN 0897915917. (1993). https://doi.org/10.1145/167088.167119 Fan, X., Chai, Q.: Roll-dpos: A randomized delegated proof of stake scheme for scalable blockchain-based internet of things systems. In: Proceedings of the 15th EAI International Conference on Mobile and Ubiquitous Systems: Computing, Networking and Services. MobiQuitous ’18, pp. 482–484. New York (2018). https://doi.org/10.1145/3286978.3287023 De Angelis, S.: Assessing security and performances of consensus algorithms for permissioned blockchains. arXiv preprint arXiv:1805.03490, (2018) García-PérezÁGotsmanAMeshmanYSergeyIAhmedAPaxos consensus, deconstructed and abstractedProgramming Languages and Systems2018ChamSpringer International Publishing91293910.1007/978-3-319-89884-1_321418.68017 Guo, H., Di, S., Gupta, R., Peterka, T., Cappello, F.: La VALSE: scalable log visualization for fault characterization in supercomputers. In: Childs, H., Cucchietti, F. (eds.) Eurographics Symposium on Parallel Graphics and Visualization. The Eurographics Association (2018) SnirMWisniewskiRWAbrahamJAAdveSVBagchiSBalajiPBelakJBosePCappelloFCarlsonBChienAACoteusPDebardelebenNADinizPCEngelmannCErezMFazzariSGeistAGuptaRJohnsonFKrishnamoorthySLeyfferSLibertyDMitraSMunsonTSchreiberRStearleyJHensbergenEVAddressing failures in exascale computingInt. J. High Perform. Comput. Appl.201428212917310.1177/1094342014522573 King, S., Nadal, S.: Ppcoin: Peer-to-peer crypto-currency with proof-of-stake. self-published paper, (2012) Miguel CastroBLPractical byzantine fault tolerance and proactive recoveryACM Trans. Comput. Syst.200220439846110.1145/571637.571640 Moise, I.: Efficient agreement protocols in asynchronous distributed systems. In: 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum, pp. 2022–2025. IEEE, (2011) Buntinas, D.: Scalable distributed consensus to support mpi fault tolerance. In : 2012 IEEE 26th International Parallel and Distributed Processing Symposium, pp. 1240–1249 (2012). https://doi.org/10.1109/IPDPS.2012.113 BoydSParikhNChuEPeleatoBEcksteinJDistributed optimization and statistical learning via the alternating direction method of multipliersFound. Trends Mach. Learn.201131112210.1561/22000000161229.90122 Altarawneh, A., Skjellum, A.: The security ingredients for correct and byzantine fault-tolerant blockchain consensus algorithms. In: 2020 International Symposium on Networks, Computers and Communications (ISNCC), pp. 1–9, (2020). https://doi.org/10.1109/ISNCC49221.2020.9297326 ChangT-HHongMLiaoW-CWangXAsynchronous distributed admm for large-scale optimization-part i: algorithm and convergence analysisIEEE Trans. Signal Process.2016641231183130349410710.1109/TSP.2016.25372711414.94106 Katti, A., Di Fatta, G., Naughton, T., Engelmann, C.: Scalable and fault tolerant failure detection and consensus. In: Proceedings of the 22nd European MPI Users’ Group Meeting. Association for Computing Machinery, EuroMPI ’15, New York, (2015) https://doi.org/10.1145/2802658.2802660 Altarawneh, A., Herschberg, T., Medury, S., Kandah, F., Skjellum, A.: Buterin’s scalability trilemma viewed through a state-change-based classification for common consensus algorithms. In: 2020 10th Annual Computing and Communication Workshop and Conference (CCWC), pp. 0727–0736 (2020). https://doi.org/10.1109/CCWC47524.2020.9031204 IsmailLMaterwalaHA review of blockchain architecture and consensus protocols: use cases, challenges, and solutionsSymmetry201910.3390/sym11101198 Ferreira, K., Stearley, J., Laros, J. H., Oldfield, R., Pedretti, K., Brightwell, R., Riesen, R., Bridges, P. G., Arnold, D.: Evaluating the viability of process replication reliability for exascale systems. In: SC ’11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–12. (2011b). https://doi.org/10.1145/2063384.2063443 El-Sayed, N., Schroeder, B.: Reading between the lines of failure l M Snir (749_CR54) 2014; 28 BL Miguel Castro (749_CR67) 2002; 20 749_CR63 T Brokaw (749_CR43) 2000; 72 749_CR60 749_CR22 749_CR21 749_CR27 749_CR26 749_CR24 D Dolev (749_CR6) 1985; 32 A Bar-Noy (749_CR16) 1991; 4 749_CR29 Y Moses (749_CR47) 2009; 69 749_CR28 L Lamport (749_CR51) 1982; 4 B Schroeder (749_CR56) 2009; 7 T Leesatapornwongsa (749_CR57) 2016; 51 C Dwork (749_CR32) 1993 749_CR30 L Lamport (749_CR33) 1983; 30 749_CR34 749_CR31 Á García-Pérez (749_CR66) 2018 749_CR36 749_CR35 T-H Chang (749_CR23) 2016; 64 S Sankaran (749_CR38) 2005; 19 749_CR39 C Dwork (749_CR65) 1988; 35 749_CR40 J Hursey (749_CR64) 2011 749_CR45 749_CR44 749_CR42 J Stone (749_CR59) 2000; 30 749_CR49 W Nowakowski (749_CR11) 2017; 48 MJ Fischer (749_CR37) 1983 749_CR48 749_CR46 N Losada (749_CR62) 2020; 106 749_CR1 749_CR4 749_CR5 749_CR2 L Ismail (749_CR20) 2019 749_CR8 749_CR9 749_CR7 749_CR52 749_CR50 749_CR12 749_CR55 749_CR10 749_CR53 749_CR15 749_CR14 749_CR58 N Sultana (749_CR3) 2019; 84 749_CR13 S Boyd (749_CR61) 2011; 3 749_CR19 W Bland (749_CR25) 2013; 27 749_CR18 749_CR17 L Lamport (749_CR41) 1998; 16 |
| References_xml | – reference: IsmailLMaterwalaHA review of blockchain architecture and consensus protocols: use cases, challenges, and solutionsSymmetry201910.3390/sym11101198 – reference: BlandWBouteillerAHeraultTBosilcaGDongarraJPost-failure recovery of MPI communication capability: design and rationaleInt. J. High Perform. Comput. Appl.201327324425410.1177/1094342013488238 – reference: García-PérezÁGotsmanAMeshmanYSergeyIAhmedAPaxos consensus, deconstructed and abstractedProgramming Languages and Systems2018ChamSpringer International Publishing91293910.1007/978-3-319-89884-1_321418.68017 – reference: DworkCLynchNStockmeyerLConsensus in the presence of partial synchronyJ. ACM198835228832393525410.1145/42282.42283 – reference: Di, S., Gupta, R., Snir, M., Pershey, E., Cappello, F.: Logaider: A tool for mining potential correlations of hpc log events. In: 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pp. 442–451 (2017). https://doi.org/10.1109/CCGRID.2017.18 – reference: Altarawneh, A., Skjellum, A.: The security ingredients for correct and byzantine fault-tolerant blockchain consensus algorithms. In: 2020 International Symposium on Networks, Computers and Communications (ISNCC), pp. 1–9, (2020). https://doi.org/10.1109/ISNCC49221.2020.9297326 – reference: SchroederBGibsonGAA large-scale study of failures in high-performance computing systemsIEEE Trans. Depend. Secur. Comput.20097433735010.1109/TDSC.2009.4 – reference: Libby, R.: Effective HPC hardware management and failure prediction strategy using IPMI. In: Proceedings of the Linux Symposium. Citeseer, (2003) – reference: Amin, H.: Toward a scalable, transactional, fault-tolerant message passing interface for petascale and exascale machines. PhD dissertation, The University of Alabama at Birmingham (2014) – reference: Borowsky, E., Gafni, E.: Generalized flp impossibility result fort-resilient asynchronous computations. In: Proceedings of the Twenty-Fifth Annual ACM Symposium on Theory of Computing. STOC ’93, pp. 91–100. Association for Computing Machinery, New York. ISBN 0897915917. (1993). https://doi.org/10.1145/167088.167119 – reference: NowakowskiWNetwork management software for redundant ethernet ringTheor. Appl. Sci.201748242910.15863/TAS.2017.04.48.5 – reference: ChangT-HHongMLiaoW-CWangXAsynchronous distributed admm for large-scale optimization-part i: algorithm and convergence analysisIEEE Trans. Signal Process.2016641231183130349410710.1109/TSP.2016.25372711414.94106 – reference: LamportLThe part-time parliamentACM Trans. Comput. Syst.199816213316910.1145/279227.2792291455.68033 – reference: Fromentin, E., Raynal, M., Tronel, F.: On classes of problems in asyn- chronous distributed systems with process crashes. In: Proceedings. 19th IEEE International Conference on Distributed Computing Systems (Cat. No.99CB37003), pp. 470–477 (1999). https://doi.org/10.1109/ICDCS.1999.776549 – reference: De Angelis, S.: Assessing security and performances of consensus algorithms for permissioned blockchains. arXiv preprint arXiv:1805.03490, (2018) – reference: LeesatapornwongsaTLukmanJFLuSGunawiHSTaxDC: a taxonomy of non-deterministic concurrency bugs in datacenter distributed systemsSIGPLAN Not.201651451753010.1145/2954679.2872374 – reference: Ropars, T., Lefray, A., Kim, D., Schiper, A.: Efficient process replication for MPI applications: Sharing work between replicas. In: 2015 IEEE International Parallel and Distributed Processing Symposium, pp. 645–654, (2015). https://doi.org/10.1109/IPDPS.2015.29 – reference: FischerMJKarpinskiMThe consensus problem in unreliable distributed systems (a brief survey)Foundations of Computation Theory1983Berlin HeidelbergSpringer12714010.1007/3-540-12689-9_99 – reference: Miguel CastroBLPractical byzantine fault tolerance and proactive recoveryACM Trans. Comput. Syst.200220439846110.1145/571637.571640 – reference: SnirMWisniewskiRWAbrahamJAAdveSVBagchiSBalajiPBelakJBosePCappelloFCarlsonBChienAACoteusPDebardelebenNADinizPCEngelmannCErezMFazzariSGeistAGuptaRJohnsonFKrishnamoorthySLeyfferSLibertyDMitraSMunsonTSchreiberRStearleyJHensbergenEVAddressing failures in exascale computingInt. J. High Perform. Comput. Appl.201428212917310.1177/1094342014522573 – reference: Bosilca, G., Bouteiller, A., Herault, T., Le Fèvre, V., Robert, Y., Dongarra, J.: Revisiting credit distribution algorithms for distributed termination detection. In: 2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 611–620 (2021). https://doi.org/10.1109/IPDPSW52791.2021.00095 – reference: Ongaro, D., Ousterhout, J.: In search of an understandable consensus algorithm. In: Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference. USENIX Association, USENIX ATC’14, pp. 305-320, USA (2014) – reference: DolevDReischukRBounds on information exchange for byzantine agreementJ. ACM (JACM)198532119120483233810.1145/2455.2141120629.68026 – reference: Costa, C. H. A., Park, Y., Rosenburg, B. S., Cher, C.-Y., Ryu, K. D.: A system software approach to proactive memory-error avoidance. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’14, pp. 707–718. IEEE Press, (2014). https://doi.org/10.1109/SC.2014.63 – reference: LosadaNGonzálezPMartínMJBosilcaGBouteillerATeranishiKFault tolerance of MPI applications in exascale systems: the ULFM solutionFuture Gener. Comput. Syst.202010646748110.1016/j.future.2020.01.026 – reference: Martino, C. D., Jha, S., Kramer, W., Kalbarczyk, Z., Iyer, R. K.: Logdiver: A tool for measuring resilience of extreme-scale systems and applications. In: Proceedings of the 5th Workshop on Fault Tolerance for HPC at EXtreme Scale, pp. 11–18. Association for Computing Machinery, FTXS ’15, New York, (2015). https://doi.org/10.1145/2751504.2751511 – reference: Popov, S.: The tangle. White Paper 1(3) (2018) – reference: Al-Mamun, A., Li, T., Sadoghi, M., Jiang, L., Shen, H.-T., Zhao, D.: Hpchain: an mpi-based blockchain framework for data fidelity in high-performance computing systems (2019) – reference: Guo, H., Di, S., Gupta, R., Peterka, T., Cappello, F.: La VALSE: scalable log visualization for fault characterization in supercomputers. In: Childs, H., Cucchietti, F. (eds.) Eurographics Symposium on Parallel Graphics and Visualization. The Eurographics Association (2018) – reference: Ferreira, K., Stearley, J., Laros, J. H., Oldfield, R., Pedretti, K., Brightwell, R., Riesen, R., Bridges, P. G., Arnold, D.: Evaluating the viability of process replication reliability for exascale systems. In: SC ’11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–12. (2011b). https://doi.org/10.1145/2063384.2063443 – reference: El-Sayed, N., Schroeder, B.: Reading between the lines of failure logs: Understanding how hpc systems fail. In: 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 1–12. IEEE, (2013) – reference: LamportLShostakREPeaseMCThe byzantine generals problemACM Trans. Program. Lang. Syst.19824338240110.1145/357172.3571760483.68021 – reference: Altarawneh, A., Herschberg, T., Medury, S., Kandah, F., Skjellum, A.: Buterin’s scalability trilemma viewed through a state-change-based classification for common consensus algorithms. In: 2020 10th Annual Computing and Communication Workshop and Conference (CCWC), pp. 0727–0736 (2020). https://doi.org/10.1109/CCWC47524.2020.9031204 – reference: Giménez, A., Gamblin, T., Bhatele, A., Wood, C., Shoga, K., Marathe, A., Bremer, P.-T., Hamann, B., Schulz, M.: Scrubjay: Deriving knowledge from the disarray of hpc performance data. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. Association for Computing Machinery, SC ’17, New York, (2017). https://doi.org/10.1145/3126908.3126935 – reference: DworkCNaorMBrickellEFPricing via processing or combatting junk mailAdvances in Cryptology – CRYPTO’ 921993Berlin HeidelbergSpringer13914710.1007/3-540-48071-4_10 – reference: Al-Mamun, A., Zhao, D.: BAASH: enabling blockchain-as-a-service on high-performance computing systems. CoRR Preprint at arxiv: 2001.07022 (2020) – reference: LamportLThe weak byzantine generals problemJ. ACM198330366867670983910.1145/2402.3223980627.68026 – reference: Hassani, A., Skjellum, A., Brightwell, R.: Design and evaluation of FA-MPI, a transactional resilience scheme for non-blocking MPI. In: 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pp. 750–755, (2014). https://doi.org/10.1109/DSN.2014.78 – reference: Omwenga, M., Otim, J., Lumala, A.: Robust mobile cloud services through offline support, pp. 90–93 (2012). https://doi.org/10.1109/ACSEAC.2012.27 – reference: Katti, A., Di Fatta, G., Naughton, T., Engelmann, C.: Scalable and fault tolerant failure detection and consensus. In: Proceedings of the 22nd European MPI Users’ Group Meeting. Association for Computing Machinery, EuroMPI ’15, New York, (2015) https://doi.org/10.1145/2802658.2802660 – reference: Huang, S.-T.: Detecting termination of distributed computations by external agents. In: [1989] Proceedings. The 9th International Conference on Distributed Computing Systems, pp. 79–84, (1989). https://doi.org/10.1109/ICDCS.1989.37933 – reference: Baudet, M., Ching, A., Chursin, A., Danezis, G., Garillot, F., Li, Z., Malkhi, D., Naor, O., Perelman, D., Sonnino, A.: State machine replication in the libra blockchain (2019) – reference: Ferreira, K., Stearley, J., Laros, J. H., Oldfield, R., Pedretti, K., Brightwell, R., Riesen, R., Bridges, P. G., Arnold, D.: Evaluating the viability of process replication reliability for exascale systems. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. Association for Computing Machinery, SC ’11, New York (2011a). https://doi.org/10.1145/2063384.2063443 – reference: Buntinas, D.: Scalable distributed consensus to support mpi fault tolerance. In : 2012 IEEE 26th International Parallel and Distributed Processing Symposium, pp. 1240–1249 (2012). https://doi.org/10.1109/IPDPS.2012.113 – reference: Bano, S., Sonnino, A., Al-Bassam, M., Azouvi, S., McCorry, P., Meiklejohn, S., Danezis, G.: Sok: Consensus in the age of blockchains. In: Proceedings of the 1st ACM Conference on Advances in Financial Technologies, pp. 183–198 (2019) – reference: SultanaNRüfenachtMSkjellumALagunaIMohrorKFailure recovery for bulk synchronous applications with MPI stagesParallel Comput.20198411410.1016/j.parco.2019.02.007 – reference: Moise, I.: Efficient agreement protocols in asynchronous distributed systems. In: 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum, pp. 2022–2025. IEEE, (2011) – reference: Forum, M.P.I.: MPI: A Message-passing Interface Standard, Version 3.1. (2015). High-Performance Computing Center Stuttgart, University of Stuttgart, (2015). URL https://books.google.com/books?id=Fbv7jwEACAAJ – reference: StoneJPartridgeCWhen the CRC and TCP checksum disagreeSIGCOMM Comput. Commun. Rev.200030430931910.1145/347057.347561 – reference: Duan, S.: Building reliable and practical byzantine fault tolerance. PhD dissertation, University of California Davis (2016) – reference: Darius, B.: Scalable distributed consensus to support mpi fault tolerance. In: 2012 IEEE 26th International Parallel and Distributed Processing Symposium, pp. 1240–1249. IEEE, (2012) – reference: Driscoll, K., Hall, B., Paulitsch, M., Zumsteg, P., Sivencrona, H.: The real byzantine generals. In: The 23rd Digital Avionics Systems Conference (IEEE Cat. No.04CH37576), vol. 2, pp. 6.D.4–61 (2004). https://doi.org/10.1109/DASC.2004.1390734 – reference: BoydSParikhNChuEPeleatoBEcksteinJDistributed optimization and statistical learning via the alternating direction method of multipliersFound. Trends Mach. Learn.201131112210.1561/22000000161229.90122 – reference: BrokawTKoziukGThe intelligent platform management interface (IPMI) and enclosure managementElectron. Eng. (Lond.)20007219 – reference: Fan, X., Chai, Q.: Roll-dpos: A randomized delegated proof of stake scheme for scalable blockchain-based internet of things systems. In: Proceedings of the 15th EAI International Conference on Mobile and Ubiquitous Systems: Computing, Networking and Services. MobiQuitous ’18, pp. 482–484. New York (2018). https://doi.org/10.1145/3286978.3287023 – reference: Leners, J.B., Wu, H., Hung, W.-L., Aguilera, M.K, Walfish, M.: Detecting failures in distributed systems with the falcon spy network, In: Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, pp. 279–294. New York, NY, Association for Computing Machinery (2011). https://doi.org/10.1145/2043556.2043583 – reference: Castro, M., Liskov, B.: Practical byzantine fault tolerance. In: Proceedings of the Third USENIX Symposium on Operating Systems Design and Implementation (OSDI), New Orleans, Louisiana, USA, pp. 173–186, (1999). URL https://dl.acm.org/citation.cfm?id=296824 – reference: Yin, M., Malkhi, D., Reiter, M. K., Gueta, G. G., Abraham, I.: Hotstuff: Bft consensus with linearity and responsiveness. In: Proceedings of the 2019 ACM Symposium on Principles of Distributed Computing, pp. 347–356. ACM (2019) – reference: Woo, S., Lang, S., Latham, R., Ross, R., Thakur, R.: Reliable MPI-IO through layout-aware replication (2011) – reference: Cachin, C., Vukolić, M.: Blockchain consensus protocols in the wild. arXiv preprint arXiv:1707.01873, (2017) – reference: MosesYRaynalMRevisiting simultaneous consensus with crash failuresJ. Parallel Distrib. Comput.200969440040910.1016/j.jpdc.2009.01.001 – reference: HurseyJNaughtonTValleeGGrahamRLCotronisYDanalisANikolopoulosDSDongarraJA log-scaling fault tolerant agreement algorithm for a fault tolerant MPIRecent Advances in the Message Passing Interface2011Berlin HeidelbergSpringer25526310.1007/978-3-642-24449-0_29 – reference: Aguilera, M. K., Toueg, S.: Randomization and failure detection: a hybrid approach to solve consensus. Technical report (1996) – reference: Bar-NoyADolevDConsensus algorithms with one-bit messagesDistrib. Comput.199143105110109725310.1007/BF017989570723.68012 – reference: SankaranSSquyresJMBarrettBSahayVLumsdaineADuellJHargrovePRomanEThe lam/mpi checkpoint/restart framework: system-initiated checkpointingInt. J. High Perform. Comput. Appl.200519447949310.1177/1094342005056139 – reference: Hassani, A., Skjellum, A., Bangalore, P. V., Brightwell, R.: Practical resilient cases for fa-mpi, a transactional fault-tolerant mpi. In: Proceedings of the 3rd Workshop on Exascale MPI. Association for Computing Machinery, ExaMPI ’15, New York (2015). https://doi.org/10.1145/2831129.2831130 – reference: King, S., Nadal, S.: Ppcoin: Peer-to-peer crypto-currency with proof-of-stake. self-published paper, (2012) – volume: 16 start-page: 133 issue: 2 year: 1998 ident: 749_CR41 publication-title: ACM Trans. Comput. Syst. doi: 10.1145/279227.279229 – volume: 7 start-page: 337 issue: 4 year: 2009 ident: 749_CR56 publication-title: IEEE Trans. Depend. Secur. Comput. doi: 10.1109/TDSC.2009.4 – volume: 69 start-page: 400 issue: 4 year: 2009 ident: 749_CR47 publication-title: J. Parallel Distrib. Comput. doi: 10.1016/j.jpdc.2009.01.001 – ident: 749_CR31 – ident: 749_CR12 – ident: 749_CR26 doi: 10.1145/2802658.2802660 – start-page: 139 volume-title: Advances in Cryptology – CRYPTO’ 92 year: 1993 ident: 749_CR32 doi: 10.1007/3-540-48071-4_10 – ident: 749_CR28 doi: 10.1109/ISNCC49221.2020.9297326 – ident: 749_CR14 doi: 10.1109/DASC.2004.1390734 – ident: 749_CR42 doi: 10.1145/167088.167119 – start-page: 912 volume-title: Programming Languages and Systems year: 2018 ident: 749_CR66 doi: 10.1007/978-3-319-89884-1_32 – ident: 749_CR19 – ident: 749_CR46 doi: 10.1145/2043556.2043583 – volume: 84 start-page: 1 year: 2019 ident: 749_CR3 publication-title: Parallel Comput. doi: 10.1016/j.parco.2019.02.007 – ident: 749_CR21 – ident: 749_CR60 doi: 10.1109/ICDCS.1989.37933 – volume: 64 start-page: 3118 issue: 12 year: 2016 ident: 749_CR23 publication-title: IEEE Trans. Signal Process. doi: 10.1109/TSP.2016.2537271 – ident: 749_CR40 – volume: 32 start-page: 191 issue: 1 year: 1985 ident: 749_CR6 publication-title: J. ACM (JACM) doi: 10.1145/2455.214112 – ident: 749_CR29 – volume: 19 start-page: 479 issue: 4 year: 2005 ident: 749_CR38 publication-title: Int. J. High Perform. Comput. Appl. doi: 10.1177/1094342005056139 – volume: 4 start-page: 382 issue: 3 year: 1982 ident: 749_CR51 publication-title: ACM Trans. Program. Lang. Syst. doi: 10.1145/357172.357176 – ident: 749_CR48 – ident: 749_CR55 doi: 10.1109/IPDPS.2012.113 – ident: 749_CR58 doi: 10.1109/ACSEAC.2012.27 – start-page: 127 volume-title: Foundations of Computation Theory year: 1983 ident: 749_CR37 doi: 10.1007/3-540-12689-9_99 – ident: 749_CR1 doi: 10.1109/ICDCS.1999.776549 – ident: 749_CR35 doi: 10.1109/IPDPS.2011.367 – ident: 749_CR53 doi: 10.1145/3286978.3287023 – ident: 749_CR15 – ident: 749_CR2 doi: 10.1109/DSN.2014.78 – ident: 749_CR45 doi: 10.1109/CCGRID.2017.18 – volume: 28 start-page: 129 issue: 2 year: 2014 ident: 749_CR54 publication-title: Int. J. High Perform. Comput. Appl. doi: 10.1177/1094342014522573 – volume: 27 start-page: 244 issue: 3 year: 2013 ident: 749_CR25 publication-title: Int. J. High Perform. Comput. Appl. doi: 10.1177/1094342013488238 – volume: 35 start-page: 288 issue: 2 year: 1988 ident: 749_CR65 publication-title: J. ACM doi: 10.1145/42282.42283 – ident: 749_CR5 doi: 10.1109/CCWC47524.2020.9031204 – ident: 749_CR24 doi: 10.1145/3293611.3331591 – volume: 4 start-page: 105 issue: 3 year: 1991 ident: 749_CR16 publication-title: Distrib. Comput. doi: 10.1007/BF01798957 – volume: 3 start-page: 1 issue: 1 year: 2011 ident: 749_CR61 publication-title: Found. Trends Mach. Learn. doi: 10.1561/2200000016 – ident: 749_CR49 doi: 10.1007/3-540-61769-8_3 – ident: 749_CR34 doi: 10.1109/IPDPSW52791.2021.00095 – volume: 30 start-page: 309 issue: 4 year: 2000 ident: 749_CR59 publication-title: SIGCOMM Comput. Commun. Rev. doi: 10.1145/347057.347561 – ident: 749_CR22 doi: 10.1145/2063384.2063443 – ident: 749_CR4 – volume: 72 start-page: 19 year: 2000 ident: 749_CR43 publication-title: Electron. Eng. (Lond.) – ident: 749_CR44 doi: 10.1109/SC.2014.63 – volume: 48 start-page: 24 year: 2017 ident: 749_CR11 publication-title: Theor. Appl. Sci. doi: 10.15863/TAS.2017.04.48.5 – ident: 749_CR63 doi: 10.1145/2831129.2831130 – ident: 749_CR10 doi: 10.1109/IPDPS.2012.113 – ident: 749_CR50 doi: 10.1145/3458817.3476155 – ident: 749_CR52 – volume: 106 start-page: 467 year: 2020 ident: 749_CR62 publication-title: Future Gener. Comput. Syst. doi: 10.1016/j.future.2020.01.026 – year: 2019 ident: 749_CR20 publication-title: Symmetry doi: 10.3390/sym11101198 – ident: 749_CR8 – volume: 20 start-page: 398 issue: 4 year: 2002 ident: 749_CR67 publication-title: ACM Trans. Comput. Syst. doi: 10.1145/571637.571640 – ident: 749_CR27 – volume: 51 start-page: 517 issue: 4 year: 2016 ident: 749_CR57 publication-title: SIGPLAN Not. doi: 10.1145/2954679.2872374 – start-page: 255 volume-title: Recent Advances in the Message Passing Interface year: 2011 ident: 749_CR64 doi: 10.1007/978-3-642-24449-0_29 – ident: 749_CR9 doi: 10.1145/2751504.2751511 – volume: 30 start-page: 668 issue: 3 year: 1983 ident: 749_CR33 publication-title: J. ACM doi: 10.1145/2402.322398 – ident: 749_CR13 – ident: 749_CR30 – ident: 749_CR17 – ident: 749_CR39 doi: 10.1145/2063384.2063443 – ident: 749_CR18 doi: 10.1109/DSN.2013.6575356 – ident: 749_CR7 doi: 10.1145/3126908.3126935 – ident: 749_CR36 doi: 10.1109/IPDPS.2015.29 |
| SSID | ssj0009788 |
| Score | 2.2905383 |
| Snippet | Large-scale HPC systems experience failures arising from faults in hardware, software, and/or networking. Failure rates continue to grow as systems scale up... |
| SourceID | proquest crossref springer |
| SourceType | Aggregation Database Enrichment Source Index Database Publisher |
| StartPage | 128 |
| SubjectTerms | Classification Computer Science Crashes Failure rates Fault detection Fault tolerance Faults Message passing Processor Architectures Software Engineering/Programming and Operating Systems Special Issue on High-Level Parallel Programming and Applications (HLPP 2022) Synchronism System failures Theory of Computation |
| SummonAdditionalLinks | – databaseName: Springer Standard Collection dbid: RSV link: http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV3NS8MwFH_o9ODF-YnTKTl408DWJk1zHMMxwY0xdexWmjYFYXaydsL-e_Oy1qqooOemIbyX90Xe7_0ALqViLEaOsLgdM8oYF1SGskWZCQ2hyVB0aPEVkzsxHPrTqRwVoLCs7HYvnyStp_4AdhMeNsw6FOOepKtN2DLhzkfChvH9pBq1KyzbpDEfTgXjfgGV-X6Pz-GoyjG_PIvaaNOr_--ce7BbZJeks74O-7Ch0wOol8wNpDDkQxh0SC9cznKKVGgzOkaQuRExsRSZ2Dxk9UXmCUFCT2TDyMhAI0j4KXvOiEl0yWB0S8I0Jv1R9wgeezcP3T4tiBVoZCwup5GjI-Vy7SoZJ9pX0lgxll665USmPHSUSWISVzLhejHXHCfSO9xtJX7sSY39qMdQS-epPgGihIcVCzNuKmSxUKES0jP-X0XCT2TCGtAu5RtExdRxJL-YBdW8ZJRXYOQVWHkFqwZcvf_zsp658evqZqm2oLC_LMCxiEi8xd0GXJdqqj7_vNvp35afwQ7yz697x5pQyxdLfQ7b0Wv-lC0u7L18A7Is2hQ priority: 102 providerName: Springer Nature |
| Title | A Fault-Model-Relevant Classification of Consensus Mechanisms for MPI and HPC |
| URI | https://link.springer.com/article/10.1007/s10766-022-00749-y https://www.proquest.com/docview/2792728053 |
| Volume | 51 |
| WOSCitedRecordID | wos000897136200002&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVAVX databaseName: SpringerLINK Contemporary 1997-Present customDbUrl: eissn: 1573-7640 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0009788 issn: 0885-7458 databaseCode: RSV dateStart: 19970101 isFulltext: true titleUrlDefault: https://link.springer.com/search?facet-content-type=%22Journal%22 providerName: Springer Nature |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV1bS8MwFD54e_DFuzgvIw--aXC2SdM8yRwORTfLvPtSmiYFYW66TsF_b06XWhT0xZdAaRpKvpxLknPOB7ArFWMaOcL0oWaUMS6oTGSDMmsaEuuhmKTIr7i9EN1ueH8vI3fglruwylInFopaD1M8Iz_AQndIpcT9o5dXiqxReLvqKDSmYRY9Gwzp6zRaVdFdUfBOWkHiVDAeuqQZlzonAgy_9ShaUUk_vhumytv8cUFa2J324n__eAkWnMdJmpMlsgxTZrACiyWbA3HCvQqdJmknb_0xRXq0Pu1h4rmddlLQZmJAUYEhGWYEST6RISMnHYOJw0_5c06s80s60RlJBpqcRq01uGmfXLdOqSNboKmVwjFNPZMqnxtfSZ2ZUEkr2bgdMw0vtVtGT1nHJvMlE36gueFYpd7jfiMLdSANxqiuw8xgODAbQJQIcBfDrOpKmBYqUUIG1iaoVISZzFgNDsuZjlNXiRwJMfpxVUMZ0YktOnGBTvxRg72vb14mdTj-7L1dQhI7mczjCo8a7JegVq9_H23z79G2YB456CfxY9swMx69mR2YS9_HT_moDtPi7qEOs8cn3ahnn84FrRerFFvv0rYRf7Rt7-r2E-js6Z0 |
| linkProvider | ProQuest |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMw1V1LS8NAEB58gV58i_W5Bz3pYptsstmDiFRLSx8UUfEWs9kNFGpbm1bpn_I3upMmBgW99eA5yUAy37yyM_MBnAjJmEKOMFVSjDLmcCoCUaTMhIbAZCg6SOYrHhu81fKenkR7Dj6yWRhsq8x8YuKoVT_Ef-QXuOgOqZQc-2rwSpE1Ck9XMwqNKSzqevJuSrb4snZj9HtqWZXb-3KVpqwCNDRwG9HQ0qG0HW1LoSLtSWEgjHWHLlqhqY0saSJ4ZAvGbVc52sF17JZjFyNPuUJjM6aROw-LzPY42lWd03zJL094Lo3hOpQzx0uHdNJRPe5iu69FMWoLOvkeCPPs9seBbBLnKmv_7Qutw2qaUZPrqQlswJzubcJaxlZBUue1Bc1rUgnG3RFF-rcuvcPBegMrktCCYsNUglHSjwiSmCIDSEyaGgejO_FLTExyT5rtGgl6ilTb5W14mMlL7cBCr9_Tu0Akd7FKY8Y1B0xxGUguXBPzZMi9SESsAKVMs36YblpHwo-un--IRjT4Bg1-ggZ_UoCzr2cG0z0jf959kEHAT31O7Of6L8B5BqL88u_S9v6WdgzL1ftmw2_UWvV9WLFMljftlTuAhdFwrA9hKXwbdeLhUWINBJ5nDa5PcZw9eQ |
| linkToPdf | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMw1V1LS-RAEC58IXtZn8uOzz7oSRvHpDudPoiI7uCgMwRRES8xne6AMDujZlyZv7a_bqsyicEFvXnwnKQh6a9e6a_qA9jSRghLGmF23wouhFRcJ7rJBYaGBDMUlxT9FdfnqtsNb250NAF_q14YolVWPrFw1HaQ0j_yPRp0R1JK0t_LSlpEdNI6fHjkpCBFJ62VnMYYImdu9ILlW37QPsG93va81q_L41NeKgzwFKE35KnnUuNL5xttMxcajXCmGsQ1vRTrJM9gNM98LZQfWOkkjWb3pN_MQhtoR8RMXHcSphXWmEQnjORtPfBXFZqXaMSSKyHDsmGnbNtTAVF_PU4RXPPR26BYZ7r_Hc4WMa8195W_1jx8LzNtdjQ2jQWYcP1FmKtULFjp1Jagc8RayXNvyEkWrscvqOEe4cYKuVAiUhXYZYOMkbgpKYPkrOOoYfo-_50zTPpZJ2qzpG_ZaXS8DFef8lI_YKo_6LufwIwKqHoT6LITYZVJjNIBxkKTqjDTmWjAfrXLcVpOYCchkF5cz44mZMSIjLhARjxqwM7rMw_j-SMf3r1WwSEufVEe11howG4FqPry-6utfLzaJswipuLzdvdsFb55mPyNKXRrMDV8enbrMJP-Gd7nTxuFYTC4-2xs_QPg5kad |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=A+Fault-Model-Relevant+Classification+of+Consensus+Mechanisms+for+MPI+and+HPC&rft.jtitle=International+journal+of+parallel+programming&rft.au=Nansamba%2C+Grace&rft.au=Altarawneh%2C+Amani&rft.au=Skjellum%2C+Anthony&rft.date=2023-06-01&rft.issn=0885-7458&rft.eissn=1573-7640&rft.volume=51&rft.issue=2-3&rft.spage=128&rft.epage=149&rft_id=info:doi/10.1007%2Fs10766-022-00749-y&rft.externalDBID=n%2Fa&rft.externalDocID=10_1007_s10766_022_00749_y |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0885-7458&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0885-7458&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0885-7458&client=summon |