Towards Generic Failure-Prediction Models in Large-Scale Distributed Computing Systems
Gespeichert in:
| Titel: | Towards Generic Failure-Prediction Models in Large-Scale Distributed Computing Systems |
|---|---|
| Autoren: | Jagannathan, Srigoutam, Sharma, Yogesh, Taheri, Javid |
| Quelle: | Electronics. 14(17) |
| Schlagwörter: | distributed computing, fault detection, machine learning algorithms, prediction algorithms, performance evaluation, Computer Science, Datavetenskap |
| Beschreibung: | The increasing complexity of Distributed Computing (DC) systems requires advanced failure-prediction models to enhance reliability and efficiency. This study proposes a comprehensive methodology for developing generic machine learning (ML) models capable of cross-layer and cross-platform failure-prediction without requiring platform-specific retraining. Using the Grid5000 failure dataset from the Failure Trace Archive (FTA), we explored Linear and Logistic Regression, Random Forest, and XGBoost to predict three critical metrics: Time Between Failures (TBF), Time to Return/Repair (TTR), and Failing Node Identification (FNI). Our approach involved extensive exploratory data analysis (EDA), statistical examination of failure patterns, and model evaluation across the cluster, site, and system levels. The results demonstrate that XGBoost consistently outperforms the other models, achieving near-perfect 100% accuracy for TBF and FNI, with robust generalisability across diverse DC environments. In addition, we introduce a hierarchical DC architecture that integrates these failure-prediction models. In the form of a use case, we also demonstrate how service providers can use these prediction models to balance service reliability and cost. |
| Dateibeschreibung: | electronic |
| Zugangs-URL: | https://urn.kb.se/resolve?urn=urn:nbn:se:kau:diva-107038 https://doi.org/10.3390/electronics14173386 |
| Datenbank: | SwePub |
| Abstract: | The increasing complexity of Distributed Computing (DC) systems requires advanced failure-prediction models to enhance reliability and efficiency. This study proposes a comprehensive methodology for developing generic machine learning (ML) models capable of cross-layer and cross-platform failure-prediction without requiring platform-specific retraining. Using the Grid5000 failure dataset from the Failure Trace Archive (FTA), we explored Linear and Logistic Regression, Random Forest, and XGBoost to predict three critical metrics: Time Between Failures (TBF), Time to Return/Repair (TTR), and Failing Node Identification (FNI). Our approach involved extensive exploratory data analysis (EDA), statistical examination of failure patterns, and model evaluation across the cluster, site, and system levels. The results demonstrate that XGBoost consistently outperforms the other models, achieving near-perfect 100% accuracy for TBF and FNI, with robust generalisability across diverse DC environments. In addition, we introduce a hierarchical DC architecture that integrates these failure-prediction models. In the form of a use case, we also demonstrate how service providers can use these prediction models to balance service reliability and cost. |
|---|---|
| ISSN: | 20799292 |
| DOI: | 10.3390/electronics14173386 |
Full Text Finder
Nájsť tento článok vo Web of Science