Optimization of fault tolerance for iterative graph algorithm in spark GraphX based on high performance computing cluster
GraphX is a graph computing library based on Spark systems, where fault tolerance is a necessary guarantee for the high availability. However, the existing fault tolerance methods are mostly implemented in a pessimistic way and are aimed at general computing tasks. Considering the characteristics of...
Uložené v:
| Vydané v: | CCF transactions on high performance computing (Online) Ročník 7; číslo 5; s. 465 - 477 |
|---|---|
| Hlavní autori: | , , |
| Médium: | Journal Article |
| Jazyk: | English |
| Vydavateľské údaje: |
Beijing
Springer Nature B.V
01.10.2025
|
| Predmet: | |
| ISSN: | 2524-4922, 2524-4930 |
| On-line prístup: | Získať plný text |
| Tagy: |
Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
|
| Abstract | GraphX is a graph computing library based on Spark systems, where fault tolerance is a necessary guarantee for the high availability. However, the existing fault tolerance methods are mostly implemented in a pessimistic way and are aimed at general computing tasks. Considering the characteristics of iterative computation, this paper presents a combination method of the optimistic fault tolerance and checkpoint for recovering the data under different failure conditions. Firstly, for single node failure, we propose the optimistic fault tolerance mechanism based on compensation function. It does not add fault tolerance measures in advance and will not incur additional costs when there are no failures. Secondly, for multiple node failures, we propose the automatic checkpoint management strategy based on RDD importance. It comprehensively considers the factors of lineage length of RDD, dependency relationship, and computation time of RDD, which can set the RDD as the checkpoint properly. Finally, we implement our proposals in GraphX of Spark-3.5.1, and evaluate the performance by using representative iterative graph algorithms on the high performance computing cluster. The results verify the correctness of iteration results of the mechanism, and illustrate that when recovering the RDD partition, the job execution time can be reduced by the mechanism and strategy substantially. |
|---|---|
| AbstractList | GraphX is a graph computing library based on Spark systems, where fault tolerance is a necessary guarantee for the high availability. However, the existing fault tolerance methods are mostly implemented in a pessimistic way and are aimed at general computing tasks. Considering the characteristics of iterative computation, this paper presents a combination method of the optimistic fault tolerance and checkpoint for recovering the data under different failure conditions. Firstly, for single node failure, we propose the optimistic fault tolerance mechanism based on compensation function. It does not add fault tolerance measures in advance and will not incur additional costs when there are no failures. Secondly, for multiple node failures, we propose the automatic checkpoint management strategy based on RDD importance. It comprehensively considers the factors of lineage length of RDD, dependency relationship, and computation time of RDD, which can set the RDD as the checkpoint properly. Finally, we implement our proposals in GraphX of Spark-3.5.1, and evaluate the performance by using representative iterative graph algorithms on the high performance computing cluster. The results verify the correctness of iteration results of the mechanism, and illustrate that when recovering the RDD partition, the job execution time can be reduced by the mechanism and strategy substantially. |
| Author | Tian, Wenlong Fu, Zhongming He, Mengsi |
| Author_xml | – sequence: 1 givenname: Mengsi surname: He fullname: He, Mengsi – sequence: 2 givenname: Zhongming orcidid: 0000-0003-3041-6990 surname: Fu fullname: Fu, Zhongming – sequence: 3 givenname: Wenlong surname: Tian fullname: Tian, Wenlong |
| BookMark | eNo9kF1LwzAUhoNMcM79Aa8CXlfz3fVShk5hsBsF70LaJm1m29QkFeavN9vEm5wcznOeA-81mA1u0ADcYnSPEcofAiMcswwRniFE0ksuwJxwwjJWUDT7_xNyBZYh7FGicpxIMQeH3Rhtb39UtG6AzkCjpi7C6Drt1VBpaJyHNqYm2m8NG6_GFqqucd7Gtod2gGFU_hNujoMPWKqga5hMrW1aOGqf1vuTp3L9OEU7NLDqppCEN-DSqC7o5V9dgPfnp7f1S7bdbV7Xj9usImQVM8KMEYLxOq-4pgVmmou6rgpGMVWiNLw0KypUWQqzIoUuRKkFVYwhzCkiuqILcHf2jt59TTpEuXeTH9JJSYlgmOdiVSSKnKnKuxC8NnL0tlf-IDGSx5TlOWWZUpanlCWhv6MJc0s |
| Cites_doi | 10.1145/3624560 10.1145/3341301.3359653 10.1007/978-981-97-0862-8_16 10.1109/ICDE60146.2024.00040 10.1002/cpe.7610 10.1109/BigData50022.2020.9377866 10.1145/3662158.3662770 10.1109/BigComp.2018.00080 10.1007/S11227-021-04000-2 10.1007/S11432-021-3406-5 10.1109/TCC.2021.3108043 10.1109/TKDE.2020.2975652 10.1109/SBAC-PADW60351.2023.00021 10.1109/ICCCEEE49695.2021.9429597 10.1145/2523616.2523633 10.1109/TKDE.2020.3014150 10.1109/TPDS.2021.3099440 10.1109/TPDS.2020.2992073 10.1109/BigData50022.2020.9377896 10.1007/978-3-031-08333-4_37 10.1145/3448016.3452788 10.1007/s42514-021-00066-9 10.12694/SCPE.V25I3.2687 10.1007/978-3-031-25158-0_5 10.1002/cpe.8081 10.1145/3603172 |
| ContentType | Journal Article |
| Copyright | China Computer Federation (CCF) 2025. |
| Copyright_xml | – notice: China Computer Federation (CCF) 2025. |
| DBID | AAYXX CITATION JQ2 |
| DOI | 10.1007/s42514-025-00225-2 |
| DatabaseName | CrossRef ProQuest Computer Science Collection |
| DatabaseTitle | CrossRef ProQuest Computer Science Collection |
| DatabaseTitleList | ProQuest Computer Science Collection |
| DeliveryMethod | fulltext_linktorsrc |
| EISSN | 2524-4930 |
| EndPage | 477 |
| ExternalDocumentID | 10_1007_s42514_025_00225_2 |
| GroupedDBID | 0R~ 406 AACDK AAHNG AAJBT AASML AATNV AAUYE AAYXX ABAKF ABBRH ABDBE ABDZT ABECU ABFSG ABFTV ABJNI ABKCH ABMQK ABRTQ ABTEG ABTKH ABTMW ABXPI ACAOD ACDTI ACHSB ACMLO ACOKC ACPIV ACSTC ACZOJ ADKNI ADTPH ADURQ ADYFF AEFQL AEJRE AEMSY AEZWR AFBBN AFDZB AFFHD AFHIU AFKRA AFOHR AFQWF AGDGC AGJBK AGMZJ AGQEE AGRTI AHPBZ AHWEU AIGIU AILAN AITGF AIXLP AJZVZ ALMA_UNASSIGNED_HOLDINGS AMKLP AMXSW AMYLF ARAPS ATHPR AXYYD AYFIA BENPR BGLVJ BGNMA CCPQU CITATION DPUIP EBLON EBS EJD FIGPU FINBP FNLPD FSGXE GGCAI H13 HCIFZ IKXTQ IWAJR J-C JZLTJ K7- KOV LLZTM M4Y NPVJJ NQJWS NU0 PHGZM PHGZT PQGLB PT4 ROL RSV SJYHP SNE SNPRN SOHCF SOJ SRMVM SSLCW STPWE TSG UOJIU UTJUX VEKWB VFIZW ZMTXR AESKC JQ2 |
| ID | FETCH-LOGICAL-c228t-24ff6645d7c5e3914e56ddc94313a6bf5bf836abb6f829e96be63a44015302ec3 |
| IEDL.DBID | RSV |
| ISICitedReferencesCount | 0 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001469602100001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 2524-4922 |
| IngestDate | Sat Nov 08 14:32:38 EST 2025 Sat Nov 29 07:04:25 EST 2025 |
| IsPeerReviewed | true |
| IsScholarly | true |
| Issue | 5 |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-c228t-24ff6645d7c5e3914e56ddc94313a6bf5bf836abb6f829e96be63a44015302ec3 |
| Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
| ORCID | 0000-0003-3041-6990 |
| PQID | 3264157689 |
| PQPubID | 6587180 |
| PageCount | 13 |
| ParticipantIDs | proquest_journals_3264157689 crossref_primary_10_1007_s42514_025_00225_2 |
| PublicationCentury | 2000 |
| PublicationDate | 2025-10-00 20251001 |
| PublicationDateYYYYMMDD | 2025-10-01 |
| PublicationDate_xml | – month: 10 year: 2025 text: 2025-10-00 |
| PublicationDecade | 2020 |
| PublicationPlace | Beijing |
| PublicationPlace_xml | – name: Beijing |
| PublicationTitle | CCF transactions on high performance computing (Online) |
| PublicationYear | 2025 |
| Publisher | Springer Nature B.V |
| Publisher_xml | – name: Springer Nature B.V |
| References | 225_CR26 225_CR25 225_CR28 225_CR27 225_CR21 C Li (225_CR15) 2022; 78 A Sheeba (225_CR22) 2023; 35 225_CR20 Y Du (225_CR8) 2022; 33 D Cheng (225_CR6) 2023; 11 225_CR19 225_CR18 225_CR14 225_CR17 225_CR16 225_CR11 225_CR13 225_CR12 225_CR30 Z Fu (225_CR9) 2020; 31 S Tang (225_CR23) 2022; 34 225_CR1 225_CR2 225_CR3 225_CR4 M Kirti (225_CR10) 2024; 36 225_CR5 225_CR29 225_CR7 Z Tang (225_CR24) 2022; 34 |
| References_xml | – ident: 225_CR3 – ident: 225_CR5 doi: 10.1145/3624560 – ident: 225_CR27 doi: 10.1145/3341301.3359653 – ident: 225_CR28 doi: 10.1007/978-981-97-0862-8_16 – ident: 225_CR26 doi: 10.1109/ICDE60146.2024.00040 – volume: 35 issue: 7 year: 2023 ident: 225_CR22 publication-title: Concurrency and Computation: Practice and Experience doi: 10.1002/cpe.7610 – ident: 225_CR16 doi: 10.1109/BigData50022.2020.9377866 – ident: 225_CR7 doi: 10.1145/3662158.3662770 – ident: 225_CR20 doi: 10.1109/BigComp.2018.00080 – volume: 78 start-page: 3561 issue: 3 year: 2022 ident: 225_CR15 publication-title: J Supercomput doi: 10.1007/S11227-021-04000-2 – ident: 225_CR13 doi: 10.1007/S11432-021-3406-5 – volume: 11 start-page: 639 issue: 1 year: 2023 ident: 225_CR6 publication-title: IEEE Trans. Cloud Comput. doi: 10.1109/TCC.2021.3108043 – ident: 225_CR12 – volume: 34 start-page: 71 issue: 1 year: 2022 ident: 225_CR23 publication-title: IEEE Trans. Knowl. Data Eng. doi: 10.1109/TKDE.2020.2975652 – ident: 225_CR2 doi: 10.1109/SBAC-PADW60351.2023.00021 – ident: 225_CR17 doi: 10.1109/ICCCEEE49695.2021.9429597 – ident: 225_CR4 – ident: 225_CR25 doi: 10.1145/2523616.2523633 – volume: 34 start-page: 2783 issue: 6 year: 2022 ident: 225_CR24 publication-title: IEEE Trans. Knowl. Data Eng. doi: 10.1109/TKDE.2020.3014150 – volume: 33 start-page: 507 issue: 3 year: 2022 ident: 225_CR8 publication-title: IEEE Trans. Parallel Distrib. Syst. doi: 10.1109/TPDS.2021.3099440 – volume: 31 start-page: 2406 issue: 10 year: 2020 ident: 225_CR9 publication-title: IEEE Trans. Parallel Distrib. Syst. doi: 10.1109/TPDS.2020.2992073 – ident: 225_CR11 doi: 10.1109/BigData50022.2020.9377896 – ident: 225_CR1 doi: 10.1007/978-3-031-08333-4_37 – ident: 225_CR19 – ident: 225_CR18 doi: 10.1145/3448016.3452788 – ident: 225_CR14 doi: 10.1007/s42514-021-00066-9 – ident: 225_CR21 doi: 10.12694/SCPE.V25I3.2687 – ident: 225_CR29 doi: 10.1007/978-3-031-25158-0_5 – volume: 36 issue: 13 year: 2024 ident: 225_CR10 publication-title: Concurrency and Computation: Practice and Experience doi: 10.1002/cpe.8081 – ident: 225_CR30 doi: 10.1145/3603172 |
| SSID | ssj0002710226 ssib053822361 |
| Score | 2.3048472 |
| Snippet | GraphX is a graph computing library based on Spark systems, where fault tolerance is a necessary guarantee for the high availability. However, the existing... |
| SourceID | proquest crossref |
| SourceType | Aggregation Database Index Database |
| StartPage | 465 |
| SubjectTerms | Algorithms Clusters Computation Data integrity Distributed processing Failure Fault tolerance High performance computing Iterative methods Performance evaluation |
| Title | Optimization of fault tolerance for iterative graph algorithm in spark GraphX based on high performance computing cluster |
| URI | https://www.proquest.com/docview/3264157689 |
| Volume | 7 |
| WOSCitedRecordID | wos001469602100001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVAVX databaseName: Springer LINK Contemporary (1997 - Present) customDbUrl: eissn: 2524-4930 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0002710226 issn: 2524-4922 databaseCode: RSV dateStart: 20190501 isFulltext: true titleUrlDefault: https://link.springer.com/search?facet-content-type=%22Journal%22 providerName: Springer Nature |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1NS8NAEF1K8eDFD1SsVpmDN10kk80mexSxehAVv-gtJNldLbZpaVPBf-_sNlEKeug5MISZ2TdvSPY9xk5UKHSSBQG3QtGCooOQO_ttLnItid7GOst9pW_ju7uk31cPLXb27xf88xl1VSC4s111AyfiDnADic6u4PHptWkeOrjYCIl4GEY3O73dGkYouFCI9aWZv6MuD6ZlXPbDpre52mtusY2aVMLFogu2WcuUO-zrntBgVF-zhLEFm82HFVTjoXFmGgaIrsJCVJkQD7xyNWTDt_F0UL2PYFACgc30A67dgz64caeBIjmBY5j83jeAwhtD0AiEYjh3ugu77KV39Xx5w2ujBV4gJhVHYa2UItJxEZlQBcJEUutCEbkIM5nbKLdJKLM8lzZBZZTMjQwzQauZ8xwyRbjH2uW4NPsMTBKHSsY5YqGFsKjQUslFHBPtiTIhO-y0yXI6WehppD_KyT6FKaUw9SlMscO6TSHS-mzNUiKcxDpoTVIHKwU7ZOvoyuP_xOuydjWdmyO2VnxWg9n02DfTN4klvvQ |
| linkProvider | Springer Nature |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Optimization+of+fault+tolerance+for+iterative+graph+algorithm+in+spark+GraphX+based+on+high+performance+computing+cluster&rft.jtitle=CCF+transactions+on+high+performance+computing+%28Online%29&rft.au=He%2C+Mengsi&rft.au=Fu%2C+Zhongming&rft.au=Tian%2C+Wenlong&rft.date=2025-10-01&rft.pub=Springer+Nature+B.V&rft.issn=2524-4922&rft.eissn=2524-4930&rft.volume=7&rft.issue=5&rft.spage=465&rft.epage=477&rft_id=info:doi/10.1007%2Fs42514-025-00225-2&rft.externalDBID=NO_FULL_TEXT |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2524-4922&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2524-4922&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2524-4922&client=summon |