In EDS ansehen

Towards Generic Failure-Prediction Models in Large-Scale Distributed Computing Systems

Gespeichert in:

Bibliographische Detailangaben
Titel:	Towards Generic Failure-Prediction Models in Large-Scale Distributed Computing Systems
Autoren:	Jagannathan, Srigoutam, Sharma, Yogesh, Taheri, Javid
Quelle:	Electronics. 14(17)
Schlagwörter:	distributed computing, fault detection, machine learning algorithms, prediction algorithms, performance evaluation, Computer Science, Datavetenskap
Beschreibung:	The increasing complexity of Distributed Computing (DC) systems requires advanced failure-prediction models to enhance reliability and efficiency. This study proposes a comprehensive methodology for developing generic machine learning (ML) models capable of cross-layer and cross-platform failure-prediction without requiring platform-specific retraining. Using the Grid5000 failure dataset from the Failure Trace Archive (FTA), we explored Linear and Logistic Regression, Random Forest, and XGBoost to predict three critical metrics: Time Between Failures (TBF), Time to Return/Repair (TTR), and Failing Node Identification (FNI). Our approach involved extensive exploratory data analysis (EDA), statistical examination of failure patterns, and model evaluation across the cluster, site, and system levels. The results demonstrate that XGBoost consistently outperforms the other models, achieving near-perfect 100% accuracy for TBF and FNI, with robust generalisability across diverse DC environments. In addition, we introduce a hierarchical DC architecture that integrates these failure-prediction models. In the form of a use case, we also demonstrate how service providers can use these prediction models to balance service reliability and cost.
Dateibeschreibung:	electronic
Zugangs-URL:	https://urn.kb.se/resolve?urn=urn:nbn:se:kau:diva-107038 https://doi.org/10.3390/electronics14173386
Datenbank:	SwePub

View record in SwePub

Full Text Finder

Nájsť tento článok vo Web of Science

Beschreibung
Abstract:	The increasing complexity of Distributed Computing (DC) systems requires advanced failure-prediction models to enhance reliability and efficiency. This study proposes a comprehensive methodology for developing generic machine learning (ML) models capable of cross-layer and cross-platform failure-prediction without requiring platform-specific retraining. Using the Grid5000 failure dataset from the Failure Trace Archive (FTA), we explored Linear and Logistic Regression, Random Forest, and XGBoost to predict three critical metrics: Time Between Failures (TBF), Time to Return/Repair (TTR), and Failing Node Identification (FNI). Our approach involved extensive exploratory data analysis (EDA), statistical examination of failure patterns, and model evaluation across the cluster, site, and system levels. The results demonstrate that XGBoost consistently outperforms the other models, achieving near-perfect 100% accuracy for TBF and FNI, with robust generalisability across diverse DC environments. In addition, we introduce a hierarchical DC architecture that integrates these failure-prediction models. In the form of a use case, we also demonstrate how service providers can use these prediction models to balance service reliability and cost.
ISSN:	20799292
DOI:	10.3390/electronics14173386