Quantifying event correlations for proactive failure management in networked computing systems

Networked computing systems continue to grow in scale and in the complexity of their components and interactions. Component failures become norms instead of exceptions in these environments. Moreover, failure events exhibit strong correlations in the time and space domains. In this paper, we develop...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Journal of parallel and distributed computing Ročník 70; číslo 11; s. 1100 - 1109
Hlavní autoři: Fu, Song, Xu, Cheng-Zhong
Médium: Journal Article
Jazyk:angličtina
Vydáno: Amsterdam Elsevier Inc 01.11.2010
Elsevier
Témata:
ISSN:0743-7315, 1096-0848
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:Networked computing systems continue to grow in scale and in the complexity of their components and interactions. Component failures become norms instead of exceptions in these environments. Moreover, failure events exhibit strong correlations in the time and space domains. In this paper, we develop a spherical covariance model with an adjustable timescale parameter to quantify the temporal correlation and a stochastic model to characterize spatial correlation. The models are further extended to take into account the information of application allocation to discover more correlations among failure instances. We cluster failure events based on their correlations and predict their future occurrences. Experimental results on a production coalition system, the Wayne State Computational Grid, show the offline and online predictions made by our predicting system can forecast 72.7–85.3% of the failure occurrences and capture failure correlations in a cluster coalition environment. ► Temporal and spatial failure correlations are modeled and quantified to characterize failure dynamics. ► A prediction mechanism that explores failure correlations is proposed to forecast future failure occurrences. ► High prediction accuracy is achieved in offline and online predictions on a production networked computer system.
Bibliografie:ObjectType-Article-2
SourceType-Scholarly Journals-1
ObjectType-Feature-1
content type line 23
ISSN:0743-7315
1096-0848
DOI:10.1016/j.jpdc.2010.06.010