Towards Generic Failure-Prediction Models in Large-Scale Distributed Computing Systems

Uloženo v:
Podrobná bibliografie
Název: Towards Generic Failure-Prediction Models in Large-Scale Distributed Computing Systems
Autoři: Jagannathan, Srigoutam, Sharma, Yogesh, Taheri, Javid
Zdroj: Electronics. 14(17)
Témata: distributed computing, fault detection, machine learning algorithms, prediction algorithms, performance evaluation, Computer Science, Datavetenskap
Popis: The increasing complexity of Distributed Computing (DC) systems requires advanced failure-prediction models to enhance reliability and efficiency. This study proposes a comprehensive methodology for developing generic machine learning (ML) models capable of cross-layer and cross-platform failure-prediction without requiring platform-specific retraining. Using the Grid5000 failure dataset from the Failure Trace Archive (FTA), we explored Linear and Logistic Regression, Random Forest, and XGBoost to predict three critical metrics: Time Between Failures (TBF), Time to Return/Repair (TTR), and Failing Node Identification (FNI). Our approach involved extensive exploratory data analysis (EDA), statistical examination of failure patterns, and model evaluation across the cluster, site, and system levels. The results demonstrate that XGBoost consistently outperforms the other models, achieving near-perfect 100% accuracy for TBF and FNI, with robust generalisability across diverse DC environments. In addition, we introduce a hierarchical DC architecture that integrates these failure-prediction models. In the form of a use case, we also demonstrate how service providers can use these prediction models to balance service reliability and cost.
Popis souboru: electronic
Přístupová URL adresa: https://urn.kb.se/resolve?urn=urn:nbn:se:kau:diva-107038
https://doi.org/10.3390/electronics14173386
Databáze: SwePub
Popis
Abstrakt:The increasing complexity of Distributed Computing (DC) systems requires advanced failure-prediction models to enhance reliability and efficiency. This study proposes a comprehensive methodology for developing generic machine learning (ML) models capable of cross-layer and cross-platform failure-prediction without requiring platform-specific retraining. Using the Grid5000 failure dataset from the Failure Trace Archive (FTA), we explored Linear and Logistic Regression, Random Forest, and XGBoost to predict three critical metrics: Time Between Failures (TBF), Time to Return/Repair (TTR), and Failing Node Identification (FNI). Our approach involved extensive exploratory data analysis (EDA), statistical examination of failure patterns, and model evaluation across the cluster, site, and system levels. The results demonstrate that XGBoost consistently outperforms the other models, achieving near-perfect 100% accuracy for TBF and FNI, with robust generalisability across diverse DC environments. In addition, we introduce a hierarchical DC architecture that integrates these failure-prediction models. In the form of a use case, we also demonstrate how service providers can use these prediction models to balance service reliability and cost.
ISSN:20799292
DOI:10.3390/electronics14173386