Optimization of fault tolerance for iterative graph algorithm in spark GraphX based on high performance computing cluster

GraphX is a graph computing library based on Spark systems, where fault tolerance is a necessary guarantee for the high availability. However, the existing fault tolerance methods are mostly implemented in a pessimistic way and are aimed at general computing tasks. Considering the characteristics of...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:CCF transactions on high performance computing (Online) Ročník 7; číslo 5; s. 465 - 477
Hlavní autori: He, Mengsi, Fu, Zhongming, Tian, Wenlong
Médium: Journal Article
Jazyk:English
Vydavateľské údaje: Beijing Springer Nature B.V 01.10.2025
Predmet:
ISSN:2524-4922, 2524-4930
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Abstract GraphX is a graph computing library based on Spark systems, where fault tolerance is a necessary guarantee for the high availability. However, the existing fault tolerance methods are mostly implemented in a pessimistic way and are aimed at general computing tasks. Considering the characteristics of iterative computation, this paper presents a combination method of the optimistic fault tolerance and checkpoint for recovering the data under different failure conditions. Firstly, for single node failure, we propose the optimistic fault tolerance mechanism based on compensation function. It does not add fault tolerance measures in advance and will not incur additional costs when there are no failures. Secondly, for multiple node failures, we propose the automatic checkpoint management strategy based on RDD importance. It comprehensively considers the factors of lineage length of RDD, dependency relationship, and computation time of RDD, which can set the RDD as the checkpoint properly. Finally, we implement our proposals in GraphX of Spark-3.5.1, and evaluate the performance by using representative iterative graph algorithms on the high performance computing cluster. The results verify the correctness of iteration results of the mechanism, and illustrate that when recovering the RDD partition, the job execution time can be reduced by the mechanism and strategy substantially.
AbstractList GraphX is a graph computing library based on Spark systems, where fault tolerance is a necessary guarantee for the high availability. However, the existing fault tolerance methods are mostly implemented in a pessimistic way and are aimed at general computing tasks. Considering the characteristics of iterative computation, this paper presents a combination method of the optimistic fault tolerance and checkpoint for recovering the data under different failure conditions. Firstly, for single node failure, we propose the optimistic fault tolerance mechanism based on compensation function. It does not add fault tolerance measures in advance and will not incur additional costs when there are no failures. Secondly, for multiple node failures, we propose the automatic checkpoint management strategy based on RDD importance. It comprehensively considers the factors of lineage length of RDD, dependency relationship, and computation time of RDD, which can set the RDD as the checkpoint properly. Finally, we implement our proposals in GraphX of Spark-3.5.1, and evaluate the performance by using representative iterative graph algorithms on the high performance computing cluster. The results verify the correctness of iteration results of the mechanism, and illustrate that when recovering the RDD partition, the job execution time can be reduced by the mechanism and strategy substantially.
Author Tian, Wenlong
Fu, Zhongming
He, Mengsi
Author_xml – sequence: 1
  givenname: Mengsi
  surname: He
  fullname: He, Mengsi
– sequence: 2
  givenname: Zhongming
  orcidid: 0000-0003-3041-6990
  surname: Fu
  fullname: Fu, Zhongming
– sequence: 3
  givenname: Wenlong
  surname: Tian
  fullname: Tian, Wenlong
BookMark eNo9kF1LwzAUhoNMcM79Aa8CXlfz3fVShk5hsBsF70LaJm1m29QkFeavN9vEm5wcznOeA-81mA1u0ADcYnSPEcofAiMcswwRniFE0ksuwJxwwjJWUDT7_xNyBZYh7FGicpxIMQeH3Rhtb39UtG6AzkCjpi7C6Drt1VBpaJyHNqYm2m8NG6_GFqqucd7Gtod2gGFU_hNujoMPWKqga5hMrW1aOGqf1vuTp3L9OEU7NLDqppCEN-DSqC7o5V9dgPfnp7f1S7bdbV7Xj9usImQVM8KMEYLxOq-4pgVmmou6rgpGMVWiNLw0KypUWQqzIoUuRKkFVYwhzCkiuqILcHf2jt59TTpEuXeTH9JJSYlgmOdiVSSKnKnKuxC8NnL0tlf-IDGSx5TlOWWZUpanlCWhv6MJc0s
Cites_doi 10.1145/3624560
10.1145/3341301.3359653
10.1007/978-981-97-0862-8_16
10.1109/ICDE60146.2024.00040
10.1002/cpe.7610
10.1109/BigData50022.2020.9377866
10.1145/3662158.3662770
10.1109/BigComp.2018.00080
10.1007/S11227-021-04000-2
10.1007/S11432-021-3406-5
10.1109/TCC.2021.3108043
10.1109/TKDE.2020.2975652
10.1109/SBAC-PADW60351.2023.00021
10.1109/ICCCEEE49695.2021.9429597
10.1145/2523616.2523633
10.1109/TKDE.2020.3014150
10.1109/TPDS.2021.3099440
10.1109/TPDS.2020.2992073
10.1109/BigData50022.2020.9377896
10.1007/978-3-031-08333-4_37
10.1145/3448016.3452788
10.1007/s42514-021-00066-9
10.12694/SCPE.V25I3.2687
10.1007/978-3-031-25158-0_5
10.1002/cpe.8081
10.1145/3603172
ContentType Journal Article
Copyright China Computer Federation (CCF) 2025.
Copyright_xml – notice: China Computer Federation (CCF) 2025.
DBID AAYXX
CITATION
JQ2
DOI 10.1007/s42514-025-00225-2
DatabaseName CrossRef
ProQuest Computer Science Collection
DatabaseTitle CrossRef
ProQuest Computer Science Collection
DatabaseTitleList ProQuest Computer Science Collection
DeliveryMethod fulltext_linktorsrc
EISSN 2524-4930
EndPage 477
ExternalDocumentID 10_1007_s42514_025_00225_2
GroupedDBID 0R~
406
AACDK
AAHNG
AAJBT
AASML
AATNV
AAUYE
AAYXX
ABAKF
ABBRH
ABDBE
ABDZT
ABECU
ABFSG
ABFTV
ABJNI
ABKCH
ABMQK
ABRTQ
ABTEG
ABTKH
ABTMW
ABXPI
ACAOD
ACDTI
ACHSB
ACMLO
ACOKC
ACPIV
ACSTC
ACZOJ
ADKNI
ADTPH
ADURQ
ADYFF
AEFQL
AEJRE
AEMSY
AEZWR
AFBBN
AFDZB
AFFHD
AFHIU
AFKRA
AFOHR
AFQWF
AGDGC
AGJBK
AGMZJ
AGQEE
AGRTI
AHPBZ
AHWEU
AIGIU
AILAN
AITGF
AIXLP
AJZVZ
ALMA_UNASSIGNED_HOLDINGS
AMKLP
AMXSW
AMYLF
ARAPS
ATHPR
AXYYD
AYFIA
BENPR
BGLVJ
BGNMA
CCPQU
CITATION
DPUIP
EBLON
EBS
EJD
FIGPU
FINBP
FNLPD
FSGXE
GGCAI
H13
HCIFZ
IKXTQ
IWAJR
J-C
JZLTJ
K7-
KOV
LLZTM
M4Y
NPVJJ
NQJWS
NU0
PHGZM
PHGZT
PQGLB
PT4
ROL
RSV
SJYHP
SNE
SNPRN
SOHCF
SOJ
SRMVM
SSLCW
STPWE
TSG
UOJIU
UTJUX
VEKWB
VFIZW
ZMTXR
AESKC
JQ2
ID FETCH-LOGICAL-c228t-24ff6645d7c5e3914e56ddc94313a6bf5bf836abb6f829e96be63a44015302ec3
IEDL.DBID RSV
ISICitedReferencesCount 0
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001469602100001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 2524-4922
IngestDate Sat Nov 08 14:32:38 EST 2025
Sat Nov 29 07:04:25 EST 2025
IsPeerReviewed true
IsScholarly true
Issue 5
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c228t-24ff6645d7c5e3914e56ddc94313a6bf5bf836abb6f829e96be63a44015302ec3
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ORCID 0000-0003-3041-6990
PQID 3264157689
PQPubID 6587180
PageCount 13
ParticipantIDs proquest_journals_3264157689
crossref_primary_10_1007_s42514_025_00225_2
PublicationCentury 2000
PublicationDate 2025-10-00
20251001
PublicationDateYYYYMMDD 2025-10-01
PublicationDate_xml – month: 10
  year: 2025
  text: 2025-10-00
PublicationDecade 2020
PublicationPlace Beijing
PublicationPlace_xml – name: Beijing
PublicationTitle CCF transactions on high performance computing (Online)
PublicationYear 2025
Publisher Springer Nature B.V
Publisher_xml – name: Springer Nature B.V
References 225_CR26
225_CR25
225_CR28
225_CR27
225_CR21
C Li (225_CR15) 2022; 78
A Sheeba (225_CR22) 2023; 35
225_CR20
Y Du (225_CR8) 2022; 33
D Cheng (225_CR6) 2023; 11
225_CR19
225_CR18
225_CR14
225_CR17
225_CR16
225_CR11
225_CR13
225_CR12
225_CR30
Z Fu (225_CR9) 2020; 31
S Tang (225_CR23) 2022; 34
225_CR1
225_CR2
225_CR3
225_CR4
M Kirti (225_CR10) 2024; 36
225_CR5
225_CR29
225_CR7
Z Tang (225_CR24) 2022; 34
References_xml – ident: 225_CR3
– ident: 225_CR5
  doi: 10.1145/3624560
– ident: 225_CR27
  doi: 10.1145/3341301.3359653
– ident: 225_CR28
  doi: 10.1007/978-981-97-0862-8_16
– ident: 225_CR26
  doi: 10.1109/ICDE60146.2024.00040
– volume: 35
  issue: 7
  year: 2023
  ident: 225_CR22
  publication-title: Concurrency and Computation: Practice and Experience
  doi: 10.1002/cpe.7610
– ident: 225_CR16
  doi: 10.1109/BigData50022.2020.9377866
– ident: 225_CR7
  doi: 10.1145/3662158.3662770
– ident: 225_CR20
  doi: 10.1109/BigComp.2018.00080
– volume: 78
  start-page: 3561
  issue: 3
  year: 2022
  ident: 225_CR15
  publication-title: J Supercomput
  doi: 10.1007/S11227-021-04000-2
– ident: 225_CR13
  doi: 10.1007/S11432-021-3406-5
– volume: 11
  start-page: 639
  issue: 1
  year: 2023
  ident: 225_CR6
  publication-title: IEEE Trans. Cloud Comput.
  doi: 10.1109/TCC.2021.3108043
– ident: 225_CR12
– volume: 34
  start-page: 71
  issue: 1
  year: 2022
  ident: 225_CR23
  publication-title: IEEE Trans. Knowl. Data Eng.
  doi: 10.1109/TKDE.2020.2975652
– ident: 225_CR2
  doi: 10.1109/SBAC-PADW60351.2023.00021
– ident: 225_CR17
  doi: 10.1109/ICCCEEE49695.2021.9429597
– ident: 225_CR4
– ident: 225_CR25
  doi: 10.1145/2523616.2523633
– volume: 34
  start-page: 2783
  issue: 6
  year: 2022
  ident: 225_CR24
  publication-title: IEEE Trans. Knowl. Data Eng.
  doi: 10.1109/TKDE.2020.3014150
– volume: 33
  start-page: 507
  issue: 3
  year: 2022
  ident: 225_CR8
  publication-title: IEEE Trans. Parallel Distrib. Syst.
  doi: 10.1109/TPDS.2021.3099440
– volume: 31
  start-page: 2406
  issue: 10
  year: 2020
  ident: 225_CR9
  publication-title: IEEE Trans. Parallel Distrib. Syst.
  doi: 10.1109/TPDS.2020.2992073
– ident: 225_CR11
  doi: 10.1109/BigData50022.2020.9377896
– ident: 225_CR1
  doi: 10.1007/978-3-031-08333-4_37
– ident: 225_CR19
– ident: 225_CR18
  doi: 10.1145/3448016.3452788
– ident: 225_CR14
  doi: 10.1007/s42514-021-00066-9
– ident: 225_CR21
  doi: 10.12694/SCPE.V25I3.2687
– ident: 225_CR29
  doi: 10.1007/978-3-031-25158-0_5
– volume: 36
  issue: 13
  year: 2024
  ident: 225_CR10
  publication-title: Concurrency and Computation: Practice and Experience
  doi: 10.1002/cpe.8081
– ident: 225_CR30
  doi: 10.1145/3603172
SSID ssj0002710226
ssib053822361
Score 2.3048472
Snippet GraphX is a graph computing library based on Spark systems, where fault tolerance is a necessary guarantee for the high availability. However, the existing...
SourceID proquest
crossref
SourceType Aggregation Database
Index Database
StartPage 465
SubjectTerms Algorithms
Clusters
Computation
Data integrity
Distributed processing
Failure
Fault tolerance
High performance computing
Iterative methods
Performance evaluation
Title Optimization of fault tolerance for iterative graph algorithm in spark GraphX based on high performance computing cluster
URI https://www.proquest.com/docview/3264157689
Volume 7
WOSCitedRecordID wos001469602100001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVAVX
  databaseName: Springer LINK Contemporary (1997 - Present)
  customDbUrl:
  eissn: 2524-4930
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0002710226
  issn: 2524-4922
  databaseCode: RSV
  dateStart: 20190501
  isFulltext: true
  titleUrlDefault: https://link.springer.com/search?facet-content-type=%22Journal%22
  providerName: Springer Nature
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1NS8NAEF1K8eDFD1SsVpmDN10kk80mexSxehAVv-gtJNldLbZpaVPBf-_sNlEKeug5MISZ2TdvSPY9xk5UKHSSBQG3QtGCooOQO_ttLnItid7GOst9pW_ju7uk31cPLXb27xf88xl1VSC4s111AyfiDnADic6u4PHptWkeOrjYCIl4GEY3O73dGkYouFCI9aWZv6MuD6ZlXPbDpre52mtusY2aVMLFogu2WcuUO-zrntBgVF-zhLEFm82HFVTjoXFmGgaIrsJCVJkQD7xyNWTDt_F0UL2PYFACgc30A67dgz64caeBIjmBY5j83jeAwhtD0AiEYjh3ugu77KV39Xx5w2ujBV4gJhVHYa2UItJxEZlQBcJEUutCEbkIM5nbKLdJKLM8lzZBZZTMjQwzQauZ8xwyRbjH2uW4NPsMTBKHSsY5YqGFsKjQUslFHBPtiTIhO-y0yXI6WehppD_KyT6FKaUw9SlMscO6TSHS-mzNUiKcxDpoTVIHKwU7ZOvoyuP_xOuydjWdmyO2VnxWg9n02DfTN4klvvQ
linkProvider Springer Nature
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Optimization+of+fault+tolerance+for+iterative+graph+algorithm+in+spark+GraphX+based+on+high+performance+computing+cluster&rft.jtitle=CCF+transactions+on+high+performance+computing+%28Online%29&rft.au=He%2C+Mengsi&rft.au=Fu%2C+Zhongming&rft.au=Tian%2C+Wenlong&rft.date=2025-10-01&rft.pub=Springer+Nature+B.V&rft.issn=2524-4922&rft.eissn=2524-4930&rft.volume=7&rft.issue=5&rft.spage=465&rft.epage=477&rft_id=info:doi/10.1007%2Fs42514-025-00225-2&rft.externalDBID=NO_FULL_TEXT
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2524-4922&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2524-4922&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2524-4922&client=summon