A data and knowledge-driven practice for ensuring stability in ultra-large intelligent computing clusters

Uložené v:
Podrobná bibliografia
Názov: A data and knowledge-driven practice for ensuring stability in ultra-large intelligent computing clusters
Autori: NIU Hongweihua, HUANG Yongbao, DING Guoqiang, HUANG Bao, ZHAO Zhiwen, XU Yang, WANG Tao, ZHANG Ruiling, WANG Xuan, ZHANG Yixiang
Zdroj: Dianxin kexue, Vol 41, Pp 145-163 (2025)
Informácie o vydavateľovi: Beijing Xintong Media Co., Ltd, 2025.
Rok vydania: 2025
Zbierka: LCC:Telecommunication
LCC:Technology
Predmety: intelligent computing cluster, fault diagnosis, SA-BiLSTM, knowledge graph, Telecommunication, TK5101-6720, Technology
Popis: A data and knowledge-driven stability assurance scheme for such clusters was proposed to address the issues of frequent hardware failures, persistently high task training failure rates, and difficulties in cross-domain problem localization within ultra-large intelligent computing clusters with over ten thousand computing cards. The cluster performance data was collected by employing heterogeneous resource integrated collection technology and distributed real-time big data ETL techniques. Fault diagnosis was performed using an enhanced SA-BiLSTM deep learning model, improving the explainability of diagnostic model outputs via knowledge graph analysis and matching for the generation of fault diagnosis reports. In the process of extracting time series features with the deep learning model, weighted fusion of features extracted at different scales , thereby improving the accuracy of the fault diagnosis model. In fault diagnosis simulation experiments conducted on an 18 000-card cluster, it was observed that the loss value gradually converged and stabilized at 0.047, achieving an accuracy rate of 98.4%. Practical has shown that the proposed stability assurance scheme can effectively support large-scale model training and enhance the reliability of intelligent computing clusters, providing a solid foundation for the construction of larger-scale intelligent computing clusters and the training of large models in the future.
Druh dokumentu: article
Popis súboru: electronic resource
Jazyk: Chinese
ISSN: 1000-0801
Relation: https://doaj.org/toc/1000-0801
DOI: 10.11959/j.issn.1000-0801.2025151
Prístupová URL adresa: https://doaj.org/article/ae0d66fca8de48a4903e3dc9c1938b1b
Prístupové číslo: edsdoj.0d66fca8de48a4903e3dc9c1938b1b
Databáza: Directory of Open Access Journals
Popis
Abstrakt:A data and knowledge-driven stability assurance scheme for such clusters was proposed to address the issues of frequent hardware failures, persistently high task training failure rates, and difficulties in cross-domain problem localization within ultra-large intelligent computing clusters with over ten thousand computing cards. The cluster performance data was collected by employing heterogeneous resource integrated collection technology and distributed real-time big data ETL techniques. Fault diagnosis was performed using an enhanced SA-BiLSTM deep learning model, improving the explainability of diagnostic model outputs via knowledge graph analysis and matching for the generation of fault diagnosis reports. In the process of extracting time series features with the deep learning model, weighted fusion of features extracted at different scales , thereby improving the accuracy of the fault diagnosis model. In fault diagnosis simulation experiments conducted on an 18 000-card cluster, it was observed that the loss value gradually converged and stabilized at 0.047, achieving an accuracy rate of 98.4%. Practical has shown that the proposed stability assurance scheme can effectively support large-scale model training and enhance the reliability of intelligent computing clusters, providing a solid foundation for the construction of larger-scale intelligent computing clusters and the training of large models in the future.
ISSN:10000801
DOI:10.11959/j.issn.1000-0801.2025151