Quantifying event correlations for proactive failure management in networked computing systems

Networked computing systems continue to grow in scale and in the complexity of their components and interactions. Component failures become norms instead of exceptions in these environments. Moreover, failure events exhibit strong correlations in the time and space domains. In this paper, we develop...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Journal of parallel and distributed computing Jg. 70; H. 11; S. 1100 - 1109
Hauptverfasser: Fu, Song, Xu, Cheng-Zhong
Format: Journal Article
Sprache:Englisch
Veröffentlicht: Amsterdam Elsevier Inc 01.11.2010
Elsevier
Schlagworte:
ISSN:0743-7315, 1096-0848
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Abstract Networked computing systems continue to grow in scale and in the complexity of their components and interactions. Component failures become norms instead of exceptions in these environments. Moreover, failure events exhibit strong correlations in the time and space domains. In this paper, we develop a spherical covariance model with an adjustable timescale parameter to quantify the temporal correlation and a stochastic model to characterize spatial correlation. The models are further extended to take into account the information of application allocation to discover more correlations among failure instances. We cluster failure events based on their correlations and predict their future occurrences. Experimental results on a production coalition system, the Wayne State Computational Grid, show the offline and online predictions made by our predicting system can forecast 72.7–85.3% of the failure occurrences and capture failure correlations in a cluster coalition environment. ► Temporal and spatial failure correlations are modeled and quantified to characterize failure dynamics. ► A prediction mechanism that explores failure correlations is proposed to forecast future failure occurrences. ► High prediction accuracy is achieved in offline and online predictions on a production networked computer system.
AbstractList Networked computing systems continue to grow in scale and in the complexity of their components and interactions. Component failures become norms instead of exceptions in these environments. Moreover, failure events exhibit strong correlations in the time and space domains. In this paper, we develop a spherical covariance model with an adjustable timescale parameter to quantify the temporal correlation and a stochastic model to characterize spatial correlation. The models are further extended to take into account the information of application allocation to discover more correlations among failure instances. We cluster failure events based on their correlations and predict their future occurrences. Experimental results on a production coalition system, the Wayne State Computational Grid, show the offline and online predictions made by our predicting system can forecast 72.7–85.3% of the failure occurrences and capture failure correlations in a cluster coalition environment. ► Temporal and spatial failure correlations are modeled and quantified to characterize failure dynamics. ► A prediction mechanism that explores failure correlations is proposed to forecast future failure occurrences. ► High prediction accuracy is achieved in offline and online predictions on a production networked computer system.
Networked computing systems continue to grow in scale and in the complexity of their components and interactions. Component failures become norms instead of exceptions in these environments. Moreover, failure events exhibit strong correlations in the time and space domains. In this paper, we develop a spherical covariance model with an adjustable timescale parameter to quantify the temporal correlation and a stochastic model to characterize spatial correlation. The models are further extended to take into account the information of application allocation to discover more correlations among failure instances. We cluster failure events based on their correlations and predict their future occurrences. Experimental results on a production coalition system, the Wayne State Computational Grid, show the offline and online predictions made by our predicting system can forecast 72.7-85.3% of the failure occurrences and capture failure correlations in a cluster coalition environment. a-[ordm Temporal and spatial failure correlations are modeled and quantified to characterize failure dynamics. a-[ordm A prediction mechanism that explores failure correlations is proposed to forecast future failure occurrences. a-[ordm High prediction accuracy is achieved in offline and online predictions on a production networked computer system.
Author Fu, Song
Xu, Cheng-Zhong
Author_xml – sequence: 1
  givenname: Song
  surname: Fu
  fullname: Fu, Song
  email: songfu@unt.edu, song@nmt.edu
  organization: Department of Computer Science and Engineering, University of North Texas, United States
– sequence: 2
  givenname: Cheng-Zhong
  surname: Xu
  fullname: Xu, Cheng-Zhong
  email: czxu@wayne.edu
  organization: Department of Electrical and Computer Engineering, Wayne State University, United States
BackLink http://pascal-francis.inist.fr/vibad/index.php?action=getRecordDetail&idt=23293338$$DView record in Pascal Francis
BookMark eNp9kEtr3DAURkVIIZPHH-jKm9KVp3qMZRuyCaFNCoFSSLcR19JV0NSWJpI8Zf595U7IoousPhDnXMQ5J6c-eCTkI6NrRpn8sl1vd0avOS0PVK7LnJAVo72sabfpTsmKthtRt4I1Z-Q8pS2ljDVttyJPP2fw2dmD888V7tHnSocYcYTsgk-VDbHaxQA6uz1WFtw4R6wm8PCM00I7X3nMf0L8jaao027Oy6l0SBmndEk-WBgTXr3uBfn17evj7X398OPu--3NQ62F5LnGbhg20ANHIXvOBBhuAbhgg0arWwqd1ojGMhxwgwPvTc8aIQeDlhvZgLggn493y19fZkxZTS5pHEfwGOakWkk5kw1vC_nplYSkYbQRvHZJ7aKbIB4UF7wXQnSF40dOx5BSRPuGMKqW5mqrluZqaa6oVGWK1P0naZf_lcyxpHtfvT6qWDLtHUaVtEOv0biIOisT3Hv6X_cLos4
CitedBy_id crossref_primary_10_1016_j_eswa_2014_09_014
crossref_primary_10_1007_s11277_017_4582_8
crossref_primary_10_1016_j_jpdc_2012_09_007
crossref_primary_10_3390_s18061844
crossref_primary_10_1007_s10664_014_9303_2
crossref_primary_10_1016_j_infsof_2019_06_011
crossref_primary_10_1016_j_jnca_2010_07_011
crossref_primary_10_1016_j_jpdc_2012_06_012
Cites_doi 10.1109/CCGRID.2009.21
10.1109/TR.2002.802886
10.1145/378420.378434
10.1109/SRDS.2007.4365694
10.1145/6420.6422
10.1109/DSN.2007.23
10.1109/MC.2003.1160055
10.1109/TSE.1987.232855
10.1109/SRDS.2006.9
10.1088/1126-6708/2003/03/040
10.1016/j.jpdc.2006.05.006
10.1109/ISCC.2010.5546715
10.1109/IPDPS.2006.1639672
10.1145/956790.956799
10.1145/511361.511362
10.1198/016214501753382282
10.1007/978-3-7091-9198-9_9
10.1016/j.jpdc.2010.01.002
10.1145/1362622.1362678
10.1109/DSN.2004.1311948
10.1147/sj.421.0005
10.1109/SRDS.2006.16
ContentType Journal Article
Copyright 2010 Elsevier Inc.
2015 INIST-CNRS
Copyright_xml – notice: 2010 Elsevier Inc.
– notice: 2015 INIST-CNRS
DBID AAYXX
CITATION
IQODW
7SC
8FD
JQ2
L7M
L~C
L~D
DOI 10.1016/j.jpdc.2010.06.010
DatabaseName CrossRef
Pascal-Francis
Computer and Information Systems Abstracts
Technology Research Database
ProQuest Computer Science Collection
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts – Academic
Computer and Information Systems Abstracts Professional
DatabaseTitle CrossRef
Computer and Information Systems Abstracts
Technology Research Database
Computer and Information Systems Abstracts – Academic
Advanced Technologies Database with Aerospace
ProQuest Computer Science Collection
Computer and Information Systems Abstracts Professional
DatabaseTitleList
Computer and Information Systems Abstracts
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
Applied Sciences
EISSN 1096-0848
EndPage 1109
ExternalDocumentID 23293338
10_1016_j_jpdc_2010_06_010
S0743731510001218
GroupedDBID --K
--M
-~X
.~1
0R~
1B1
1~.
1~5
29L
4.4
457
4G.
5GY
5VS
7-5
71M
8P~
9JN
AACTN
AAEDT
AAEDW
AAIAV
AAIKJ
AAKOC
AALRI
AAOAW
AAQFI
AAQXK
AAXUO
AAYFN
ABBOA
ABEFU
ABFNM
ABFSI
ABJNI
ABMAC
ABTAH
ABXDB
ABYKQ
ACDAQ
ACGFS
ACNNM
ACRLP
ACZNC
ADBBV
ADEZE
ADFGL
ADHUB
ADJOM
ADMUD
ADTZH
AEBSH
AECPX
AEKER
AENEX
AFKWA
AFTJW
AGHFR
AGUBO
AGYEJ
AHHHB
AHJVU
AHZHX
AIALX
AIEXJ
AIKHN
AITUG
AJBFU
AJOXV
ALMA_UNASSIGNED_HOLDINGS
AMFUW
AMRAJ
AOUOD
ASPBG
AVWKF
AXJTR
AZFZN
BJAXD
BKOJK
BLXMC
CAG
COF
CS3
DM4
DU5
E.L
EBS
EFBJH
EFLBG
EJD
EO8
EO9
EP2
EP3
F5P
FDB
FEDTE
FGOYB
FIRID
FNPLU
FYGXN
G-2
G-Q
G8K
GBLVA
GBOLZ
HLZ
HVGLF
HZ~
H~9
IHE
J1W
JJJVA
K-O
KOM
LG5
LG9
LY7
M41
MO0
N9A
O-L
O9-
OAUVE
OZT
P-8
P-9
P2P
PC.
Q38
R2-
RIG
ROL
RPZ
SBC
SDF
SDG
SDP
SES
SET
SEW
SPC
SPCBC
SST
SSV
SSZ
T5K
TN5
TWZ
WUQ
XJT
XOL
XPP
ZMT
ZU3
ZY4
~G-
~G0
9DU
AATTM
AAXKI
AAYWO
AAYXX
ABDPE
ABWVN
ACLOT
ACRPL
ACVFH
ADCNI
ADNMO
ADVLN
AEIPS
AEUPX
AFJKZ
AFPUW
AGQPQ
AIGII
AIIUN
AKBMS
AKRWK
AKYEP
ANKPU
APXCP
CITATION
EFKBS
~HD
AFXIZ
AGCQF
AGRNS
BNPGV
IQODW
SSH
7SC
8FD
JQ2
L7M
L~C
L~D
ID FETCH-LOGICAL-c362t-e8bb4a9a2e369213ad2faa231bcefc70a8cceedf1ebe4eb29d91536bdef2d65a3
ISICitedReferencesCount 28
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000282191700002&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 0743-7315
IngestDate Sun Nov 09 14:17:57 EST 2025
Mon Jul 21 09:12:14 EDT 2025
Sat Nov 29 07:13:31 EST 2025
Tue Nov 18 22:42:22 EST 2025
Fri Feb 23 02:27:56 EST 2024
IsPeerReviewed true
IsScholarly true
Issue 11
Keywords Autonomic management
Failure characterization
Networked computing systems
Temporal correlation
System availability
Spatial correlation
Availability
Autonomous system
Production system
Probabilistic approach
Grid
Interconnected power system
Time correlation
Network management
Distributed system
Distributed computing
Modeling
Reactive system
Model matching
Covariance
Coalition
Breakdown
Proactive service
Language English
License CC BY 4.0
LinkModel OpenURL
MergedId FETCHMERGED-LOGICAL-c362t-e8bb4a9a2e369213ad2faa231bcefc70a8cceedf1ebe4eb29d91536bdef2d65a3
Notes ObjectType-Article-2
SourceType-Scholarly Journals-1
ObjectType-Feature-1
content type line 23
PQID 760216527
PQPubID 23500
PageCount 10
ParticipantIDs proquest_miscellaneous_760216527
pascalfrancis_primary_23293338
crossref_primary_10_1016_j_jpdc_2010_06_010
crossref_citationtrail_10_1016_j_jpdc_2010_06_010
elsevier_sciencedirect_doi_10_1016_j_jpdc_2010_06_010
PublicationCentury 2000
PublicationDate 2010-11-01
PublicationDateYYYYMMDD 2010-11-01
PublicationDate_xml – month: 11
  year: 2010
  text: 2010-11-01
  day: 01
PublicationDecade 2010
PublicationPlace Amsterdam
PublicationPlace_xml – name: Amsterdam
PublicationTitle Journal of parallel and distributed computing
PublicationYear 2010
Publisher Elsevier Inc
Elsevier
Publisher_xml – name: Elsevier Inc
– name: Elsevier
References Weka: the University of Waikato, machine learning software in java. Available at
R.K. Sahoo, A.J. Oliner, I. Rish, et al. Critical event prediction for proactive management in large-scale computer clusters, in: Proceeding of ACM Conference on Knowledge Discovery and Data Mining, SIGKDD, 2003.
Y. Liang, Y. Zhang, A. Sivasubramaniam, M. Jette, R.K. Sahoo, BlueGene/L failure analysis and prediction models, in: Proceeding of IEEE International Conference on Dependable Systems and Networks, DSN, 2006.
P. Yalagandula, S. Nath, H. Yu, P.B. Gibbons, S. Sesha, Beyond availability: towards a deeper understanding of machine failure characteristics in large distributed systems, in: Proceeding of USENIX WORLDS, 2004.
Fu (br000050) 2010; 70
R. Vilalta, S. Ma, Predicting rare events in temporal domains, in: Proceeding of IEEE International Conference on Data Mining, ICDM, 2002.
Data lifeguard. Available at
T. Heath, R.P. Martin, T.D. Nguyen, Improving cluster availability using workstation validation, in: Proceeding of ACM International Conference on Measurement and modeling of computer systems, SIGMETRICS, 2002.
S. Fu, C.-Z. Xu, Quantifying temporal and spatial correlation of failure events for proactive management, in: Proceeding of IEEE International Symposium on Reliable Distributed Systems, SRDS, 2007.
K. Vaidyanathan, R.E. Harper, S.W. Hunter, K.S. Trivedi, Analysis and implementation of software rejuvenation in cluster systems, in: Proceeding of ACM International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS, 2001.
Fu, Xu (br000060) 2006; 66
Iyer, Rossetti, Hsueh (br000095) 1986; 4
J. Dunagan, N.J.A. Harvey, M.B. Jones, D. Kostic, M. Theimer, A. Wolman, FUSE: Lightweight guaranteed distributed failure notification, in: Proceeding of USENIX Symposium on Operating Systems Design and Implementation, OSDI, 2004.
Ganek, Corbi (br000075) 2003; 42
J. Xu, Z. Kalbarczyk, R.K. Iyer, Networked windows NT system field failure data analysis, in: Proceeding of Pacific Rim Symposium on Dependable Computing, PRDC, 1999.
F. Salfner, M. Schieschke, M. Malek, Predicting failures of computer systems: a case study for a telecommunication system, in: Proceeding of Workshop on Dependable Parallel, Distributed and Network-Centric Systems in Conjunction with International Parallel and Distributed Processing Symposium, 2006.
S. Fu, Failure-aware construction and reconfiguration of distributed virtual machines for high availability computing, in: Proceeding of IEEE/ACM International Symposium on Cluster Computing and the Grid, CCGrid, 2009.
B. Schroeder, G. Gibson, A large-scale study of failures in HPC systems, in: Proceeding of IEEE International Conference on Dependable Systems and Networks, DSN, 2006.
N. Sridhar, Decentralized local failure detection in dynamic distributed systems, in: Proceeding of IEEE International Symposium on Reliable Distributed Systems, SRDS, 2006.
J. Srinivasan, S.V. Adve, P. Bose, J.A. Rivers, A reliability odometer—lemon check your processor, in: Proceeding of Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS, 2004.
J. Mickens, B. Noble, Exploiting availability prediction in distributed systems, in: Proceeding of USENIX Symposium on Networked Systems Design and Implementation, NSDI, 2006.
B. Chun, A. Vahdat, Workload and failure characterization on a large-scale federated testbed, Technical Report IRB-TR-03-040, Intel Research Berkeley, 2003.
D. Tang, R.K. Iyer, Impact of correlated failures on dependability in a VAX cluster system, in: Proceeding of IFIP Working Conference on Dependable Computing for Critical Applications, 1991.
S. Fu, Dependability enhancement for coalition clusters with autonomic failure management, in: Proceeding of IEEE International Symposium on Computers and Communications, ISCC, 2010.
Hughes, Murray, Kreutz-Delgado, Elkan (br000090) 2002; 51
Mourad, Andrews (br000125) 1987; 13
Berger, Oliveira, Sansó (br000010) 2001; 96
D. Tang, R.K. Iyer, S.S. Subramani, Failure analysis and modelling of a VAX cluster system, in: Proceeding of IEEE International Symposium on Fault-Tolerant Computing, FTCS, 1990.
H. Berenji, J. Ametha, D. Vengerov, Inductive learning for fault diagnosis, in: Proceeding of IEEE International Conference on Fuzzy Systems, 2003.
Gretl: GNU regression, econometrics and time-series library. Available at
V.U.B. Challagulla, F.B. Bastani, I.-L. Yen, R.A. Paul, Empirical assessment of machine learning based software defect prediction techniques, in: Proceeding of Workshop on Object-Oriented Real-Time Dependable Systems, 2005.
.
J. Meyer, L. Wei, Analysis of workload influence on dependability, in: Proceeding of IEEE International Symposium on Fault-Tolerant Computing, FTCS, 1988.
Kephart, Chess (br000100) 2003; 36
Y. Liang, Y. Zhang, A. Sivasubramaniam, R. Sahoo, J. Moreira, M. Gupta, Filtering failure logs for a BlueGene/L prototype, in: Proceeding of IEEE International Conference on Dependable Systems and Networks, DSN, 2005.
E. Schuchman, T.N. Vijaykumar, BlackJack: hard error detection with redundant threads on SMT, in: Proceeding of IEEE International Conference on Dependable Systems and Networks, DSN, 2007.
Siguardian. Available at
S. Fu, C.-Z. Xu, Exploring event correlation for failure prediction in coalitions of clusters, in: Proceeding of ACM/IEEE Conference on Supercomputing, SC, 2007.
Z. Zhang, S. Fu, Failure prediction for autonomic management of networked computer systems with availability assurance, in: Proceeding of IEEE International Workshop on Dependable Parallel, Distributed and Network-Centric Systems, in conjunction with IEEE International Parallel and Distributed Processing Symposium, IPDPS, 2010.
R.K. Sahoo, A. Sivasubramaniam, M.S. Squillante, Y. Zhang, Failure data analysis of a large-scale heterogeneous server environment, in: Proceeding of IEEE International Conference on Dependable Systems and Networks, DSN, 2004.
Wayne State University, Grid computing. Available at
G.M. Weiss, H. Hirsh, Learning to predict rare events in event sequences, in: Proceeding of ACM International Conference on Knowledge Discovery and Data Mining, SIGKDD, 1998.
M. Wiesmann, P. Urban, X. Defago, An SNMP based failure detection service, in: Proceeding of IEEE International Symposium on Reliable Distributed Systems, SRDS, 2006.
X. CastiUo, D.P. Siewlorek, Workload, performance and reliability of digital computing systems, in: Proceeding of Symposium on Fault-Tolerant Computing, FTCS, 1981.
S. Fu, C.-Z. Xu, Service migration in distributed virtual machines for adaptive grid computing, in: Proceeding of the International Conference on Parallel Processing, ICPP, 2005.
10.1016/j.jpdc.2010.06.010_br000150
10.1016/j.jpdc.2010.06.010_br000030
10.1016/j.jpdc.2010.06.010_br000195
10.1016/j.jpdc.2010.06.010_br000130
10.1016/j.jpdc.2010.06.010_br000175
10.1016/j.jpdc.2010.06.010_br000190
10.1016/j.jpdc.2010.06.010_br000070
Iyer (10.1016/j.jpdc.2010.06.010_br000095) 1986; 4
10.1016/j.jpdc.2010.06.010_br000170
Hughes (10.1016/j.jpdc.2010.06.010_br000090) 2002; 51
Fu (10.1016/j.jpdc.2010.06.010_br000050) 2010; 70
Ganek (10.1016/j.jpdc.2010.06.010_br000075) 2003; 42
Mourad (10.1016/j.jpdc.2010.06.010_br000125) 1987; 13
Kephart (10.1016/j.jpdc.2010.06.010_br000100) 2003; 36
10.1016/j.jpdc.2010.06.010_br000005
10.1016/j.jpdc.2010.06.010_br000105
10.1016/j.jpdc.2010.06.010_br000205
10.1016/j.jpdc.2010.06.010_br000165
10.1016/j.jpdc.2010.06.010_br000220
10.1016/j.jpdc.2010.06.010_br000045
10.1016/j.jpdc.2010.06.010_br000145
10.1016/j.jpdc.2010.06.010_br000200
10.1016/j.jpdc.2010.06.010_br000025
10.1016/j.jpdc.2010.06.010_br000040
10.1016/j.jpdc.2010.06.010_br000085
10.1016/j.jpdc.2010.06.010_br000140
10.1016/j.jpdc.2010.06.010_br000020
10.1016/j.jpdc.2010.06.010_br000185
10.1016/j.jpdc.2010.06.010_br000065
10.1016/j.jpdc.2010.06.010_br000120
10.1016/j.jpdc.2010.06.010_br000080
10.1016/j.jpdc.2010.06.010_br000180
10.1016/j.jpdc.2010.06.010_br000160
Berger (10.1016/j.jpdc.2010.06.010_br000010) 2001; 96
Fu (10.1016/j.jpdc.2010.06.010_br000060) 2006; 66
10.1016/j.jpdc.2010.06.010_br000015
10.1016/j.jpdc.2010.06.010_br000115
10.1016/j.jpdc.2010.06.010_br000215
10.1016/j.jpdc.2010.06.010_br000055
10.1016/j.jpdc.2010.06.010_br000110
10.1016/j.jpdc.2010.06.010_br000155
10.1016/j.jpdc.2010.06.010_br000210
10.1016/j.jpdc.2010.06.010_br000035
10.1016/j.jpdc.2010.06.010_br000135
References_xml – reference: S. Fu, C.-Z. Xu, Service migration in distributed virtual machines for adaptive grid computing, in: Proceeding of the International Conference on Parallel Processing, ICPP, 2005.
– reference: R.K. Sahoo, A.J. Oliner, I. Rish, et al. Critical event prediction for proactive management in large-scale computer clusters, in: Proceeding of ACM Conference on Knowledge Discovery and Data Mining, SIGKDD, 2003.
– reference: F. Salfner, M. Schieschke, M. Malek, Predicting failures of computer systems: a case study for a telecommunication system, in: Proceeding of Workshop on Dependable Parallel, Distributed and Network-Centric Systems in Conjunction with International Parallel and Distributed Processing Symposium, 2006.
– volume: 70
  start-page: 384
  year: 2010
  end-page: 393
  ident: br000050
  article-title: Failure-aware resource management for high-availability computing clusters with distributed virtual machines
  publication-title: Journal of Parallel and Distributed Computing
– volume: 13
  start-page: 1135
  year: 1987
  end-page: 1139
  ident: br000125
  article-title: On the reliability of the IBM MVS/XA operating system
  publication-title: IEEE Transactions on Software Engineering
– reference: N. Sridhar, Decentralized local failure detection in dynamic distributed systems, in: Proceeding of IEEE International Symposium on Reliable Distributed Systems, SRDS, 2006.
– reference: S. Fu, Dependability enhancement for coalition clusters with autonomic failure management, in: Proceeding of IEEE International Symposium on Computers and Communications, ISCC, 2010.
– volume: 66
  start-page: 1442
  year: 2006
  end-page: 1454
  ident: br000060
  article-title: Stochastic modeling and analysis of hybrid mobility in reconfigurable distributed virtual machines
  publication-title: Journal of Parallel and Distributed Computing
– reference: S. Fu, C.-Z. Xu, Exploring event correlation for failure prediction in coalitions of clusters, in: Proceeding of ACM/IEEE Conference on Supercomputing, SC, 2007.
– reference: T. Heath, R.P. Martin, T.D. Nguyen, Improving cluster availability using workstation validation, in: Proceeding of ACM International Conference on Measurement and modeling of computer systems, SIGMETRICS, 2002.
– reference: E. Schuchman, T.N. Vijaykumar, BlackJack: hard error detection with redundant threads on SMT, in: Proceeding of IEEE International Conference on Dependable Systems and Networks, DSN, 2007.
– reference: S. Fu, C.-Z. Xu, Quantifying temporal and spatial correlation of failure events for proactive management, in: Proceeding of IEEE International Symposium on Reliable Distributed Systems, SRDS, 2007.
– reference: J. Xu, Z. Kalbarczyk, R.K. Iyer, Networked windows NT system field failure data analysis, in: Proceeding of Pacific Rim Symposium on Dependable Computing, PRDC, 1999.
– reference: B. Schroeder, G. Gibson, A large-scale study of failures in HPC systems, in: Proceeding of IEEE International Conference on Dependable Systems and Networks, DSN, 2006.
– reference: D. Tang, R.K. Iyer, S.S. Subramani, Failure analysis and modelling of a VAX cluster system, in: Proceeding of IEEE International Symposium on Fault-Tolerant Computing, FTCS, 1990.
– reference: R.K. Sahoo, A. Sivasubramaniam, M.S. Squillante, Y. Zhang, Failure data analysis of a large-scale heterogeneous server environment, in: Proceeding of IEEE International Conference on Dependable Systems and Networks, DSN, 2004.
– reference: Y. Liang, Y. Zhang, A. Sivasubramaniam, R. Sahoo, J. Moreira, M. Gupta, Filtering failure logs for a BlueGene/L prototype, in: Proceeding of IEEE International Conference on Dependable Systems and Networks, DSN, 2005.
– reference: J. Dunagan, N.J.A. Harvey, M.B. Jones, D. Kostic, M. Theimer, A. Wolman, FUSE: Lightweight guaranteed distributed failure notification, in: Proceeding of USENIX Symposium on Operating Systems Design and Implementation, OSDI, 2004.
– reference: Data lifeguard. Available at:
– reference: Y. Liang, Y. Zhang, A. Sivasubramaniam, M. Jette, R.K. Sahoo, BlueGene/L failure analysis and prediction models, in: Proceeding of IEEE International Conference on Dependable Systems and Networks, DSN, 2006.
– reference: S. Fu, Failure-aware construction and reconfiguration of distributed virtual machines for high availability computing, in: Proceeding of IEEE/ACM International Symposium on Cluster Computing and the Grid, CCGrid, 2009.
– reference: R. Vilalta, S. Ma, Predicting rare events in temporal domains, in: Proceeding of IEEE International Conference on Data Mining, ICDM, 2002.
– reference: Weka: the University of Waikato, machine learning software in java. Available at:
– reference: M. Wiesmann, P. Urban, X. Defago, An SNMP based failure detection service, in: Proceeding of IEEE International Symposium on Reliable Distributed Systems, SRDS, 2006.
– reference: Wayne State University, Grid computing. Available at:
– reference: P. Yalagandula, S. Nath, H. Yu, P.B. Gibbons, S. Sesha, Beyond availability: towards a deeper understanding of machine failure characteristics in large distributed systems, in: Proceeding of USENIX WORLDS, 2004.
– volume: 4
  start-page: 214
  year: 1986
  end-page: 237
  ident: br000095
  article-title: Measurement and modeling of computer reliability as affected by system activity
  publication-title: ACM Transactions on Computer Systems
– reference: K. Vaidyanathan, R.E. Harper, S.W. Hunter, K.S. Trivedi, Analysis and implementation of software rejuvenation in cluster systems, in: Proceeding of ACM International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS, 2001.
– reference: J. Mickens, B. Noble, Exploiting availability prediction in distributed systems, in: Proceeding of USENIX Symposium on Networked Systems Design and Implementation, NSDI, 2006.
– reference: G.M. Weiss, H. Hirsh, Learning to predict rare events in event sequences, in: Proceeding of ACM International Conference on Knowledge Discovery and Data Mining, SIGKDD, 1998.
– reference: Gretl: GNU regression, econometrics and time-series library. Available at:
– reference: B. Chun, A. Vahdat, Workload and failure characterization on a large-scale federated testbed, Technical Report IRB-TR-03-040, Intel Research Berkeley, 2003.
– reference: H. Berenji, J. Ametha, D. Vengerov, Inductive learning for fault diagnosis, in: Proceeding of IEEE International Conference on Fuzzy Systems, 2003.
– reference: X. CastiUo, D.P. Siewlorek, Workload, performance and reliability of digital computing systems, in: Proceeding of Symposium on Fault-Tolerant Computing, FTCS, 1981.
– volume: 42
  start-page: 5
  year: 2003
  end-page: 18
  ident: br000075
  article-title: The dawning of the autonomic computing era
  publication-title: IBM Systems Journal
– volume: 36
  start-page: 41
  year: 2003
  end-page: 50
  ident: br000100
  article-title: The vision of autonomic computing
  publication-title: IEEE Computer
– reference: J. Srinivasan, S.V. Adve, P. Bose, J.A. Rivers, A reliability odometer—lemon check your processor, in: Proceeding of Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS, 2004.
– reference: Z. Zhang, S. Fu, Failure prediction for autonomic management of networked computer systems with availability assurance, in: Proceeding of IEEE International Workshop on Dependable Parallel, Distributed and Network-Centric Systems, in conjunction with IEEE International Parallel and Distributed Processing Symposium, IPDPS, 2010.
– volume: 96
  start-page: 1361
  year: 2001
  end-page: 1374
  ident: br000010
  article-title: Objective Bayesian analysis of spatially correlated data
  publication-title: Journal of the American Statistical Association
– reference: D. Tang, R.K. Iyer, Impact of correlated failures on dependability in a VAX cluster system, in: Proceeding of IFIP Working Conference on Dependable Computing for Critical Applications, 1991.
– reference: J. Meyer, L. Wei, Analysis of workload influence on dependability, in: Proceeding of IEEE International Symposium on Fault-Tolerant Computing, FTCS, 1988.
– volume: 51
  start-page: 350
  year: 2002
  end-page: 357
  ident: br000090
  article-title: Improved disk-drive failure warnings
  publication-title: IEEE Transactions on Reliability
– reference: V.U.B. Challagulla, F.B. Bastani, I.-L. Yen, R.A. Paul, Empirical assessment of machine learning based software defect prediction techniques, in: Proceeding of Workshop on Object-Oriented Real-Time Dependable Systems, 2005.
– reference: Siguardian. Available at:
– reference: .
– ident: 10.1016/j.jpdc.2010.06.010_br000145
– ident: 10.1016/j.jpdc.2010.06.010_br000015
– ident: 10.1016/j.jpdc.2010.06.010_br000120
– ident: 10.1016/j.jpdc.2010.06.010_br000155
– ident: 10.1016/j.jpdc.2010.06.010_br000210
– ident: 10.1016/j.jpdc.2010.06.010_br000080
– ident: 10.1016/j.jpdc.2010.06.010_br000195
– ident: 10.1016/j.jpdc.2010.06.010_br000040
  doi: 10.1109/CCGRID.2009.21
– ident: 10.1016/j.jpdc.2010.06.010_br000110
– volume: 51
  start-page: 350
  issue: 3
  year: 2002
  ident: 10.1016/j.jpdc.2010.06.010_br000090
  article-title: Improved disk-drive failure warnings
  publication-title: IEEE Transactions on Reliability
  doi: 10.1109/TR.2002.802886
– ident: 10.1016/j.jpdc.2010.06.010_br000005
– ident: 10.1016/j.jpdc.2010.06.010_br000030
– ident: 10.1016/j.jpdc.2010.06.010_br000105
– ident: 10.1016/j.jpdc.2010.06.010_br000180
  doi: 10.1145/378420.378434
– ident: 10.1016/j.jpdc.2010.06.010_br000070
  doi: 10.1109/SRDS.2007.4365694
– ident: 10.1016/j.jpdc.2010.06.010_br000200
– volume: 4
  start-page: 214
  issue: 3
  year: 1986
  ident: 10.1016/j.jpdc.2010.06.010_br000095
  article-title: Measurement and modeling of computer reliability as affected by system activity
  publication-title: ACM Transactions on Computer Systems
  doi: 10.1145/6420.6422
– ident: 10.1016/j.jpdc.2010.06.010_br000150
  doi: 10.1109/DSN.2007.23
– volume: 36
  start-page: 41
  issue: 1
  year: 2003
  ident: 10.1016/j.jpdc.2010.06.010_br000100
  article-title: The vision of autonomic computing
  publication-title: IEEE Computer
  doi: 10.1109/MC.2003.1160055
– volume: 13
  start-page: 1135
  issue: 10
  year: 1987
  ident: 10.1016/j.jpdc.2010.06.010_br000125
  article-title: On the reliability of the IBM MVS/XA operating system
  publication-title: IEEE Transactions on Software Engineering
  doi: 10.1109/TSE.1987.232855
– ident: 10.1016/j.jpdc.2010.06.010_br000185
– ident: 10.1016/j.jpdc.2010.06.010_br000205
  doi: 10.1109/SRDS.2006.9
– ident: 10.1016/j.jpdc.2010.06.010_br000055
– ident: 10.1016/j.jpdc.2010.06.010_br000025
  doi: 10.1088/1126-6708/2003/03/040
– volume: 66
  start-page: 1442
  issue: 11
  year: 2006
  ident: 10.1016/j.jpdc.2010.06.010_br000060
  article-title: Stochastic modeling and analysis of hybrid mobility in reconfigurable distributed virtual machines
  publication-title: Journal of Parallel and Distributed Computing
  doi: 10.1016/j.jpdc.2006.05.006
– ident: 10.1016/j.jpdc.2010.06.010_br000045
  doi: 10.1109/ISCC.2010.5546715
– ident: 10.1016/j.jpdc.2010.06.010_br000140
  doi: 10.1109/IPDPS.2006.1639672
– ident: 10.1016/j.jpdc.2010.06.010_br000130
  doi: 10.1145/956790.956799
– ident: 10.1016/j.jpdc.2010.06.010_br000085
  doi: 10.1145/511361.511362
– ident: 10.1016/j.jpdc.2010.06.010_br000190
– volume: 96
  start-page: 1361
  issue: 456
  year: 2001
  ident: 10.1016/j.jpdc.2010.06.010_br000010
  article-title: Objective Bayesian analysis of spatially correlated data
  publication-title: Journal of the American Statistical Association
  doi: 10.1198/016214501753382282
– ident: 10.1016/j.jpdc.2010.06.010_br000020
– ident: 10.1016/j.jpdc.2010.06.010_br000170
  doi: 10.1007/978-3-7091-9198-9_9
– volume: 70
  start-page: 384
  issue: 4
  year: 2010
  ident: 10.1016/j.jpdc.2010.06.010_br000050
  article-title: Failure-aware resource management for high-availability computing clusters with distributed virtual machines
  publication-title: Journal of Parallel and Distributed Computing
  doi: 10.1016/j.jpdc.2010.01.002
– ident: 10.1016/j.jpdc.2010.06.010_br000065
  doi: 10.1145/1362622.1362678
– ident: 10.1016/j.jpdc.2010.06.010_br000175
– ident: 10.1016/j.jpdc.2010.06.010_br000135
  doi: 10.1109/DSN.2004.1311948
– volume: 42
  start-page: 5
  issue: 1
  year: 2003
  ident: 10.1016/j.jpdc.2010.06.010_br000075
  article-title: The dawning of the autonomic computing era
  publication-title: IBM Systems Journal
  doi: 10.1147/sj.421.0005
– ident: 10.1016/j.jpdc.2010.06.010_br000220
– ident: 10.1016/j.jpdc.2010.06.010_br000215
– ident: 10.1016/j.jpdc.2010.06.010_br000115
– ident: 10.1016/j.jpdc.2010.06.010_br000165
– ident: 10.1016/j.jpdc.2010.06.010_br000160
  doi: 10.1109/SRDS.2006.16
– ident: 10.1016/j.jpdc.2010.06.010_br000035
SSID ssj0011578
Score 2.1128347
Snippet Networked computing systems continue to grow in scale and in the complexity of their components and interactions. Component failures become norms instead of...
SourceID proquest
pascalfrancis
crossref
elsevier
SourceType Aggregation Database
Index Database
Enrichment Source
Publisher
StartPage 1100
SubjectTerms Applied sciences
Associations
Autonomic management
Clusters
Computer science; control theory; systems
Computer systems and distributed systems. User interface
Correlation
Dynamical systems
Dynamics
Exact sciences and technology
Failure
Failure characterization
Mathematical models
Networked computing systems
On-line systems
Online
Software
Spatial correlation
System availability
Temporal correlation
Title Quantifying event correlations for proactive failure management in networked computing systems
URI https://dx.doi.org/10.1016/j.jpdc.2010.06.010
https://www.proquest.com/docview/760216527
Volume 70
WOSCitedRecordID wos000282191700002&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVESC
  databaseName: Elsevier SD Freedom Collection Journals 2021
  customDbUrl:
  eissn: 1096-0848
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0011578
  issn: 0743-7315
  databaseCode: AIEXJ
  dateStart: 19950101
  isFulltext: true
  titleUrlDefault: https://www.sciencedirect.com
  providerName: Elsevier
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1Lj9MwELagywEJ8UaUx8oHxKXKqnk6Oa5QV4CqAlJXijhgObaztKrSsGnQ_nxmbCfNsrDAgUtUpXnJ83k8tr_5hpBXMoUgIULZUKULL2JF4BWZSr1IFkIxlZaRMlVL5myxSPM8--hoY40pJ8CqKr24yOr_amo4B8bG1Nl_MHf_UDgBv8HocASzw_GvDP-pFUgAMulLRp5pIrECR8d5Q1qhSaOygt9ihbx0R2I1xIBVNaksN1ybjLe6NczoZiBtfjWYRQXxzUZb3QGFYrxYR2v4gB4orVlw3e7P5K3d99fVmff5a_eHW4lAVke_EmEdFsqdstCmZ3be1ZYF6VDkD3wlitUNxl2UPv2lT7fLC-ujda2k4-IlR1NHhr0koL34wE9O53O-nOXL1_U3D2uL4R68K7RykxwELM7SETk4fjfL3_e7TX5sR-zu-11yleUB_vza3wUwd2rRQLcqbT2UK0O7iVeW98ldZxt6bAHygNzQ1UNyryviQZ1Pf0S-DPBCDV7oEC8U8EJ7vFCHF7rHC11VtMcL7c1NHV4ek9OT2fLNW88V3fAkxDI7T6dFEYlMBDpMssAPhQpKIWAWUEhdSjYVqcS4qvSh90e6CDKVwaCZFEqXgUpiET4ho2pb6aeEskJD-CoTnJRHWFloqsoYphSlH0iZsXhM_K4luXSK9FgYZcM76uGaY-tzbH2O_Et_OiaT_p7a6rFce3XcGYi7iNJGihzAde19h5es2b8K5h9ZGIbpmNDOvBz8MW6yiUpv24azBKLmJA7Ysz9f8pzc3vejF2S0O2_1S3JLft-tmvNDh9IfOxO0Eg
linkProvider Elsevier
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Quantifying+event+correlations+for+proactive+failure+management+in+networked+computing+systems&rft.jtitle=Journal+of+parallel+and+distributed+computing&rft.au=Fu%2C+Song&rft.au=Xu%2C+Cheng-Zhong&rft.date=2010-11-01&rft.issn=0743-7315&rft.volume=70&rft.issue=11&rft.spage=1100&rft.epage=1109&rft_id=info:doi/10.1016%2Fj.jpdc.2010.06.010&rft.externalDBID=NO_FULL_TEXT
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0743-7315&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0743-7315&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0743-7315&client=summon