GraphFI: An Efficient Fault Injection Framework for Graph Processing on GPGPUs

As graph tasks become pervasive in real-time and safetycritical domains (e.g., financial fraud detection and electrical power systems), it is also essential to guarantee their reliable execution beyond pursuing extraordinary performance. However, due to the neglect of consideration for graph-specifi...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:2025 62nd ACM/IEEE Design Automation Conference (DAC) s. 1 - 7
Hlavní autori: Jiang, Nan, Yue, Hengshan, Tan, Jingweijia, Zhou, Mengting, Wang, Xiaonan, Wang, Yuchun, Wei, Wenda, Qiu, Meikang, Wei, Xiaohui
Médium: Konferenčný príspevok..
Jazyk:English
Vydavateľské údaje: IEEE 22.06.2025
Predmet:
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Popis
Shrnutí:As graph tasks become pervasive in real-time and safetycritical domains (e.g., financial fraud detection and electrical power systems), it is also essential to guarantee their reliable execution beyond pursuing extraordinary performance. However, due to the neglect of consideration for graph-specific execution paradigm, existing Fault Injection (FI) reliability analysis methods typically incur inaccurate system error resilience characterization, making it challenging to provide helpful guidance for efficient and reliable graph processing paradigm design. This paper proposes GraphFI, an efficient Graph Fault Injection framework on the universal parallel tasks acceleration platform (i.e., GPGPUs). Our key insight is progressively excavating the graph-specific error propagation and effect mechanisms, thereby avoiding blind FI trials. Firstly, observing that iterations with similar active vertex set exhibit similar error behavior, we propose iteration-driven GraphFI (ID-GraphFI) to solely select representative iterations for fast error resilience profile assessment. Secondly, by detecting resilience-similarity communities in graph topology, we propose topology-driven GraphFI (TD-GraphFI) that only selects representative vertices for community overall reliability evaluation. Thirdly, by exploring the graph-specific fault monotonic property, we propose the monotonicity-driven GraphFI (MD-GraphFI) to granularly draw system severe error boundaries for predictable/unnecessary fault injection avoidance. Merging them all, GraphFI can reduce system fault site space by up to two orders of magnitude, which achieves 2.1 \sim 15.2 \times speedup compared to SOTA methods while providing better reliability assessment accuracy.
DOI:10.1109/DAC63849.2025.11133241