Quantifying event correlations for proactive failure management in networked computing systems

Networked computing systems continue to grow in scale and in the complexity of their components and interactions. Component failures become norms instead of exceptions in these environments. Moreover, failure events exhibit strong correlations in the time and space domains. In this paper, we develop...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	Journal of parallel and distributed computing Ročník 70; číslo 11; s. 1100 - 1109
Hlavní autoři:	Fu, Song, Xu, Cheng-Zhong
Médium:	Journal Article
Jazyk:	angličtina
Vydáno:	Amsterdam Elsevier Inc 01.11.2010 Elsevier
Témata:	Applied sciences Associations Autonomic management Clusters Computer science; control theory; systems Computer systems and distributed systems. User interface Correlation Dynamical systems Dynamics Exact sciences and technology Failure Failure characterization Mathematical models Networked computing systems On-line systems Online Software Spatial correlation System availability Temporal correlation Autonomic management Failure characterization Networked computing systems Temporal correlation System availability Spatial correlation Availability Autonomous system Production system Probabilistic approach Grid Interconnected power system Time correlation Network management Distributed system Distributed computing Modeling Reactive system Model matching Covariance Coalition Breakdown Proactive service
ISSN:	0743-7315, 1096-0848
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Abstract	Networked computing systems continue to grow in scale and in the complexity of their components and interactions. Component failures become norms instead of exceptions in these environments. Moreover, failure events exhibit strong correlations in the time and space domains. In this paper, we develop a spherical covariance model with an adjustable timescale parameter to quantify the temporal correlation and a stochastic model to characterize spatial correlation. The models are further extended to take into account the information of application allocation to discover more correlations among failure instances. We cluster failure events based on their correlations and predict their future occurrences. Experimental results on a production coalition system, the Wayne State Computational Grid, show the offline and online predictions made by our predicting system can forecast 72.7–85.3% of the failure occurrences and capture failure correlations in a cluster coalition environment. ► Temporal and spatial failure correlations are modeled and quantified to characterize failure dynamics. ► A prediction mechanism that explores failure correlations is proposed to forecast future failure occurrences. ► High prediction accuracy is achieved in offline and online predictions on a production networked computer system.
AbstractList	Networked computing systems continue to grow in scale and in the complexity of their components and interactions. Component failures become norms instead of exceptions in these environments. Moreover, failure events exhibit strong correlations in the time and space domains. In this paper, we develop a spherical covariance model with an adjustable timescale parameter to quantify the temporal correlation and a stochastic model to characterize spatial correlation. The models are further extended to take into account the information of application allocation to discover more correlations among failure instances. We cluster failure events based on their correlations and predict their future occurrences. Experimental results on a production coalition system, the Wayne State Computational Grid, show the offline and online predictions made by our predicting system can forecast 72.7–85.3% of the failure occurrences and capture failure correlations in a cluster coalition environment. ► Temporal and spatial failure correlations are modeled and quantified to characterize failure dynamics. ► A prediction mechanism that explores failure correlations is proposed to forecast future failure occurrences. ► High prediction accuracy is achieved in offline and online predictions on a production networked computer system. Networked computing systems continue to grow in scale and in the complexity of their components and interactions. Component failures become norms instead of exceptions in these environments. Moreover, failure events exhibit strong correlations in the time and space domains. In this paper, we develop a spherical covariance model with an adjustable timescale parameter to quantify the temporal correlation and a stochastic model to characterize spatial correlation. The models are further extended to take into account the information of application allocation to discover more correlations among failure instances. We cluster failure events based on their correlations and predict their future occurrences. Experimental results on a production coalition system, the Wayne State Computational Grid, show the offline and online predictions made by our predicting system can forecast 72.7-85.3% of the failure occurrences and capture failure correlations in a cluster coalition environment. a-[ordm Temporal and spatial failure correlations are modeled and quantified to characterize failure dynamics. a-[ordm A prediction mechanism that explores failure correlations is proposed to forecast future failure occurrences. a-[ordm High prediction accuracy is achieved in offline and online predictions on a production networked computer system.
Author	Fu, Song Xu, Cheng-Zhong
Author_xml	– sequence: 1 givenname: Song surname: Fu fullname: Fu, Song email: songfu@unt.edu, song@nmt.edu organization: Department of Computer Science and Engineering, University of North Texas, United States – sequence: 2 givenname: Cheng-Zhong surname: Xu fullname: Xu, Cheng-Zhong email: czxu@wayne.edu organization: Department of Electrical and Computer Engineering, Wayne State University, United States
BackLink	http://pascal-francis.inist.fr/vibad/index.php?action=getRecordDetail&idt=23293338$$DView record in Pascal Francis
BookMark	eNp9kEtr3DAURkVIIZPHH-jKm9KVp3qMZRuyCaFNCoFSSLcR19JV0NSWJpI8Zf595U7IoousPhDnXMQ5J6c-eCTkI6NrRpn8sl1vd0avOS0PVK7LnJAVo72sabfpTsmKthtRt4I1Z-Q8pS2ljDVttyJPP2fw2dmD888V7tHnSocYcYTsgk-VDbHaxQA6uz1WFtw4R6wm8PCM00I7X3nMf0L8jaao027Oy6l0SBmndEk-WBgTXr3uBfn17evj7X398OPu--3NQ62F5LnGbhg20ANHIXvOBBhuAbhgg0arWwqd1ojGMhxwgwPvTc8aIQeDlhvZgLggn493y19fZkxZTS5pHEfwGOakWkk5kw1vC_nplYSkYbQRvHZJ7aKbIB4UF7wXQnSF40dOx5BSRPuGMKqW5mqrluZqaa6oVGWK1P0naZf_lcyxpHtfvT6qWDLtHUaVtEOv0biIOisT3Hv6X_cLos4
CitedBy_id	crossref_primary_10_1016_j_eswa_2014_09_014 crossref_primary_10_1007_s11277_017_4582_8 crossref_primary_10_1016_j_jpdc_2012_09_007 crossref_primary_10_3390_s18061844 crossref_primary_10_1007_s10664_014_9303_2 crossref_primary_10_1016_j_infsof_2019_06_011 crossref_primary_10_1016_j_jnca_2010_07_011 crossref_primary_10_1016_j_jpdc_2012_06_012
Cites_doi	10.1109/CCGRID.2009.21 10.1109/TR.2002.802886 10.1145/378420.378434 10.1109/SRDS.2007.4365694 10.1145/6420.6422 10.1109/DSN.2007.23 10.1109/MC.2003.1160055 10.1109/TSE.1987.232855 10.1109/SRDS.2006.9 10.1088/1126-6708/2003/03/040 10.1016/j.jpdc.2006.05.006 10.1109/ISCC.2010.5546715 10.1109/IPDPS.2006.1639672 10.1145/956790.956799 10.1145/511361.511362 10.1198/016214501753382282 10.1007/978-3-7091-9198-9_9 10.1016/j.jpdc.2010.01.002 10.1145/1362622.1362678 10.1109/DSN.2004.1311948 10.1147/sj.421.0005 10.1109/SRDS.2006.16
ContentType	Journal Article
Copyright	2010 Elsevier Inc. 2015 INIST-CNRS
Copyright_xml	– notice: 2010 Elsevier Inc. – notice: 2015 INIST-CNRS
DBID	AAYXX CITATION IQODW 7SC 8FD JQ2 L7M L~C L~D
DOI	10.1016/j.jpdc.2010.06.010
DatabaseName	CrossRef Pascal-Francis Computer and Information Systems Abstracts Technology Research Database ProQuest Computer Science Collection Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional
DatabaseTitle	CrossRef Computer and Information Systems Abstracts Technology Research Database Computer and Information Systems Abstracts – Academic Advanced Technologies Database with Aerospace ProQuest Computer Science Collection Computer and Information Systems Abstracts Professional
DatabaseTitleList	Computer and Information Systems Abstracts
DeliveryMethod	fulltext_linktorsrc
Discipline	Computer Science Applied Sciences
EISSN	1096-0848
EndPage	1109
ExternalDocumentID	23293338 10_1016_j_jpdc_2010_06_010 S0743731510001218
GroupedDBID	--K --M -~X .~1 0R~ 1B1 1~. 1~5 29L 4.4 457 4G. 5GY 5VS 7-5 71M 8P~ 9JN AACTN AAEDT AAEDW AAIAV AAIKJ AAKOC AALRI AAOAW AAQFI AAQXK AAXUO AAYFN ABBOA ABEFU ABFNM ABFSI ABJNI ABMAC ABTAH ABXDB ABYKQ ACDAQ ACGFS ACNNM ACRLP ACZNC ADBBV ADEZE ADFGL ADHUB ADJOM ADMUD ADTZH AEBSH AECPX AEKER AENEX AFKWA AFTJW AGHFR AGUBO AGYEJ AHHHB AHJVU AHZHX AIALX AIEXJ AIKHN AITUG AJBFU AJOXV ALMA_UNASSIGNED_HOLDINGS AMFUW AMRAJ AOUOD ASPBG AVWKF AXJTR AZFZN BJAXD BKOJK BLXMC CAG COF CS3 DM4 DU5 E.L EBS EFBJH EFLBG EJD EO8 EO9 EP2 EP3 F5P FDB FEDTE FGOYB FIRID FNPLU FYGXN G-2 G-Q G8K GBLVA GBOLZ HLZ HVGLF HZ~ H~9 IHE J1W JJJVA K-O KOM LG5 LG9 LY7 M41 MO0 N9A O-L O9- OAUVE OZT P-8 P-9 P2P PC. Q38 R2- RIG ROL RPZ SBC SDF SDG SDP SES SET SEW SPC SPCBC SST SSV SSZ T5K TN5 TWZ WUQ XJT XOL XPP ZMT ZU3 ZY4 ~G- ~G0 9DU AATTM AAXKI AAYWO AAYXX ABDPE ABWVN ACLOT ACRPL ACVFH ADCNI ADNMO ADVLN AEIPS AEUPX AFJKZ AFPUW AGQPQ AIGII AIIUN AKBMS AKRWK AKYEP ANKPU APXCP CITATION EFKBS ~HD AFXIZ AGCQF AGRNS BNPGV IQODW SSH 7SC 8FD JQ2 L7M L~C L~D
ID	FETCH-LOGICAL-c362t-e8bb4a9a2e369213ad2faa231bcefc70a8cceedf1ebe4eb29d91536bdef2d65a3
ISICitedReferencesCount	28
ISICitedReferencesURI	http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000282191700002&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN	0743-7315
IngestDate	Sun Nov 09 14:17:57 EST 2025 Mon Jul 21 09:12:14 EDT 2025 Sat Nov 29 07:13:31 EST 2025 Tue Nov 18 22:42:22 EST 2025 Fri Feb 23 02:27:56 EST 2024
IsPeerReviewed	true
IsScholarly	true
Issue	11
Keywords	Autonomic management Failure characterization Networked computing systems Temporal correlation System availability Spatial correlation Availability Autonomous system Production system Probabilistic approach Grid Interconnected power system Time correlation Network management Distributed system Distributed computing Modeling Reactive system Model matching Covariance Coalition Breakdown Proactive service
Language	English
License	CC BY 4.0
LinkModel	OpenURL
MergedId	FETCHMERGED-LOGICAL-c362t-e8bb4a9a2e369213ad2faa231bcefc70a8cceedf1ebe4eb29d91536bdef2d65a3
Notes	ObjectType-Article-2 SourceType-Scholarly Journals-1 ObjectType-Feature-1 content type line 23
PQID	760216527
PQPubID	23500
PageCount	10
ParticipantIDs	proquest_miscellaneous_760216527 pascalfrancis_primary_23293338 crossref_primary_10_1016_j_jpdc_2010_06_010 crossref_citationtrail_10_1016_j_jpdc_2010_06_010 elsevier_sciencedirect_doi_10_1016_j_jpdc_2010_06_010
PublicationCentury	2000
PublicationDate	2010-11-01
PublicationDateYYYYMMDD	2010-11-01
PublicationDate_xml	– month: 11 year: 2010 text: 2010-11-01 day: 01
PublicationDecade	2010
PublicationPlace	Amsterdam
PublicationPlace_xml	– name: Amsterdam
PublicationTitle	Journal of parallel and distributed computing
PublicationYear	2010
Publisher	Elsevier Inc Elsevier
Publisher_xml	– name: Elsevier Inc – name: Elsevier
References	Weka: the University of Waikato, machine learning software in java. Available at R.K. Sahoo, A.J. Oliner, I. Rish, et al. Critical event prediction for proactive management in large-scale computer clusters, in: Proceeding of ACM Conference on Knowledge Discovery and Data Mining, SIGKDD, 2003. Y. Liang, Y. Zhang, A. Sivasubramaniam, M. Jette, R.K. Sahoo, BlueGene/L failure analysis and prediction models, in: Proceeding of IEEE International Conference on Dependable Systems and Networks, DSN, 2006. P. Yalagandula, S. Nath, H. Yu, P.B. Gibbons, S. Sesha, Beyond availability: towards a deeper understanding of machine failure characteristics in large distributed systems, in: Proceeding of USENIX WORLDS, 2004. Fu (br000050) 2010; 70 R. Vilalta, S. Ma, Predicting rare events in temporal domains, in: Proceeding of IEEE International Conference on Data Mining, ICDM, 2002. Data lifeguard. Available at T. Heath, R.P. Martin, T.D. Nguyen, Improving cluster availability using workstation validation, in: Proceeding of ACM International Conference on Measurement and modeling of computer systems, SIGMETRICS, 2002. S. Fu, C.-Z. Xu, Quantifying temporal and spatial correlation of failure events for proactive management, in: Proceeding of IEEE International Symposium on Reliable Distributed Systems, SRDS, 2007. K. Vaidyanathan, R.E. Harper, S.W. Hunter, K.S. Trivedi, Analysis and implementation of software rejuvenation in cluster systems, in: Proceeding of ACM International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS, 2001. Fu, Xu (br000060) 2006; 66 Iyer, Rossetti, Hsueh (br000095) 1986; 4 J. Dunagan, N.J.A. Harvey, M.B. Jones, D. Kostic, M. Theimer, A. Wolman, FUSE: Lightweight guaranteed distributed failure notification, in: Proceeding of USENIX Symposium on Operating Systems Design and Implementation, OSDI, 2004. Ganek, Corbi (br000075) 2003; 42 J. Xu, Z. Kalbarczyk, R.K. Iyer, Networked windows NT system field failure data analysis, in: Proceeding of Pacific Rim Symposium on Dependable Computing, PRDC, 1999. F. Salfner, M. Schieschke, M. Malek, Predicting failures of computer systems: a case study for a telecommunication system, in: Proceeding of Workshop on Dependable Parallel, Distributed and Network-Centric Systems in Conjunction with International Parallel and Distributed Processing Symposium, 2006. S. Fu, Failure-aware construction and reconfiguration of distributed virtual machines for high availability computing, in: Proceeding of IEEE/ACM International Symposium on Cluster Computing and the Grid, CCGrid, 2009. B. Schroeder, G. Gibson, A large-scale study of failures in HPC systems, in: Proceeding of IEEE International Conference on Dependable Systems and Networks, DSN, 2006. N. Sridhar, Decentralized local failure detection in dynamic distributed systems, in: Proceeding of IEEE International Symposium on Reliable Distributed Systems, SRDS, 2006. J. Srinivasan, S.V. Adve, P. Bose, J.A. Rivers, A reliability odometer—lemon check your processor, in: Proceeding of Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS, 2004. J. Mickens, B. Noble, Exploiting availability prediction in distributed systems, in: Proceeding of USENIX Symposium on Networked Systems Design and Implementation, NSDI, 2006. B. Chun, A. Vahdat, Workload and failure characterization on a large-scale federated testbed, Technical Report IRB-TR-03-040, Intel Research Berkeley, 2003. D. Tang, R.K. Iyer, Impact of correlated failures on dependability in a VAX cluster system, in: Proceeding of IFIP Working Conference on Dependable Computing for Critical Applications, 1991. S. Fu, Dependability enhancement for coalition clusters with autonomic failure management, in: Proceeding of IEEE International Symposium on Computers and Communications, ISCC, 2010. Hughes, Murray, Kreutz-Delgado, Elkan (br000090) 2002; 51 Mourad, Andrews (br000125) 1987; 13 Berger, Oliveira, Sansó (br000010) 2001; 96 D. Tang, R.K. Iyer, S.S. Subramani, Failure analysis and modelling of a VAX cluster system, in: Proceeding of IEEE International Symposium on Fault-Tolerant Computing, FTCS, 1990. H. Berenji, J. Ametha, D. Vengerov, Inductive learning for fault diagnosis, in: Proceeding of IEEE International Conference on Fuzzy Systems, 2003. Gretl: GNU regression, econometrics and time-series library. Available at V.U.B. Challagulla, F.B. Bastani, I.-L. Yen, R.A. Paul, Empirical assessment of machine learning based software defect prediction techniques, in: Proceeding of Workshop on Object-Oriented Real-Time Dependable Systems, 2005. . J. Meyer, L. Wei, Analysis of workload influence on dependability, in: Proceeding of IEEE International Symposium on Fault-Tolerant Computing, FTCS, 1988. Kephart, Chess (br000100) 2003; 36 Y. Liang, Y. Zhang, A. Sivasubramaniam, R. Sahoo, J. Moreira, M. Gupta, Filtering failure logs for a BlueGene/L prototype, in: Proceeding of IEEE International Conference on Dependable Systems and Networks, DSN, 2005. E. Schuchman, T.N. Vijaykumar, BlackJack: hard error detection with redundant threads on SMT, in: Proceeding of IEEE International Conference on Dependable Systems and Networks, DSN, 2007. Siguardian. Available at S. Fu, C.-Z. Xu, Exploring event correlation for failure prediction in coalitions of clusters, in: Proceeding of ACM/IEEE Conference on Supercomputing, SC, 2007. Z. Zhang, S. Fu, Failure prediction for autonomic management of networked computer systems with availability assurance, in: Proceeding of IEEE International Workshop on Dependable Parallel, Distributed and Network-Centric Systems, in conjunction with IEEE International Parallel and Distributed Processing Symposium, IPDPS, 2010. R.K. Sahoo, A. Sivasubramaniam, M.S. Squillante, Y. Zhang, Failure data analysis of a large-scale heterogeneous server environment, in: Proceeding of IEEE International Conference on Dependable Systems and Networks, DSN, 2004. Wayne State University, Grid computing. Available at G.M. Weiss, H. Hirsh, Learning to predict rare events in event sequences, in: Proceeding of ACM International Conference on Knowledge Discovery and Data Mining, SIGKDD, 1998. M. Wiesmann, P. Urban, X. Defago, An SNMP based failure detection service, in: Proceeding of IEEE International Symposium on Reliable Distributed Systems, SRDS, 2006. X. CastiUo, D.P. Siewlorek, Workload, performance and reliability of digital computing systems, in: Proceeding of Symposium on Fault-Tolerant Computing, FTCS, 1981. S. Fu, C.-Z. Xu, Service migration in distributed virtual machines for adaptive grid computing, in: Proceeding of the International Conference on Parallel Processing, ICPP, 2005. 10.1016/j.jpdc.2010.06.010_br000150 10.1016/j.jpdc.2010.06.010_br000030 10.1016/j.jpdc.2010.06.010_br000195 10.1016/j.jpdc.2010.06.010_br000130 10.1016/j.jpdc.2010.06.010_br000175 10.1016/j.jpdc.2010.06.010_br000190 10.1016/j.jpdc.2010.06.010_br000070 Iyer (10.1016/j.jpdc.2010.06.010_br000095) 1986; 4 10.1016/j.jpdc.2010.06.010_br000170 Hughes (10.1016/j.jpdc.2010.06.010_br000090) 2002; 51 Fu (10.1016/j.jpdc.2010.06.010_br000050) 2010; 70 Ganek (10.1016/j.jpdc.2010.06.010_br000075) 2003; 42 Mourad (10.1016/j.jpdc.2010.06.010_br000125) 1987; 13 Kephart (10.1016/j.jpdc.2010.06.010_br000100) 2003; 36 10.1016/j.jpdc.2010.06.010_br000005 10.1016/j.jpdc.2010.06.010_br000105 10.1016/j.jpdc.2010.06.010_br000205 10.1016/j.jpdc.2010.06.010_br000165 10.1016/j.jpdc.2010.06.010_br000220 10.1016/j.jpdc.2010.06.010_br000045 10.1016/j.jpdc.2010.06.010_br000145 10.1016/j.jpdc.2010.06.010_br000200 10.1016/j.jpdc.2010.06.010_br000025 10.1016/j.jpdc.2010.06.010_br000040 10.1016/j.jpdc.2010.06.010_br000085 10.1016/j.jpdc.2010.06.010_br000140 10.1016/j.jpdc.2010.06.010_br000020 10.1016/j.jpdc.2010.06.010_br000185 10.1016/j.jpdc.2010.06.010_br000065 10.1016/j.jpdc.2010.06.010_br000120 10.1016/j.jpdc.2010.06.010_br000080 10.1016/j.jpdc.2010.06.010_br000180 10.1016/j.jpdc.2010.06.010_br000160 Berger (10.1016/j.jpdc.2010.06.010_br000010) 2001; 96 Fu (10.1016/j.jpdc.2010.06.010_br000060) 2006; 66 10.1016/j.jpdc.2010.06.010_br000015 10.1016/j.jpdc.2010.06.010_br000115 10.1016/j.jpdc.2010.06.010_br000215 10.1016/j.jpdc.2010.06.010_br000055 10.1016/j.jpdc.2010.06.010_br000110 10.1016/j.jpdc.2010.06.010_br000155 10.1016/j.jpdc.2010.06.010_br000210 10.1016/j.jpdc.2010.06.010_br000035 10.1016/j.jpdc.2010.06.010_br000135
References_xml	– reference: S. Fu, C.-Z. Xu, Service migration in distributed virtual machines for adaptive grid computing, in: Proceeding of the International Conference on Parallel Processing, ICPP, 2005. – reference: R.K. Sahoo, A.J. Oliner, I. Rish, et al. Critical event prediction for proactive management in large-scale computer clusters, in: Proceeding of ACM Conference on Knowledge Discovery and Data Mining, SIGKDD, 2003. – reference: F. Salfner, M. Schieschke, M. Malek, Predicting failures of computer systems: a case study for a telecommunication system, in: Proceeding of Workshop on Dependable Parallel, Distributed and Network-Centric Systems in Conjunction with International Parallel and Distributed Processing Symposium, 2006. – volume: 70 start-page: 384 year: 2010 end-page: 393 ident: br000050 article-title: Failure-aware resource management for high-availability computing clusters with distributed virtual machines publication-title: Journal of Parallel and Distributed Computing – volume: 13 start-page: 1135 year: 1987 end-page: 1139 ident: br000125 article-title: On the reliability of the IBM MVS/XA operating system publication-title: IEEE Transactions on Software Engineering – reference: N. Sridhar, Decentralized local failure detection in dynamic distributed systems, in: Proceeding of IEEE International Symposium on Reliable Distributed Systems, SRDS, 2006. – reference: S. Fu, Dependability enhancement for coalition clusters with autonomic failure management, in: Proceeding of IEEE International Symposium on Computers and Communications, ISCC, 2010. – volume: 66 start-page: 1442 year: 2006 end-page: 1454 ident: br000060 article-title: Stochastic modeling and analysis of hybrid mobility in reconfigurable distributed virtual machines publication-title: Journal of Parallel and Distributed Computing – reference: S. Fu, C.-Z. Xu, Exploring event correlation for failure prediction in coalitions of clusters, in: Proceeding of ACM/IEEE Conference on Supercomputing, SC, 2007. – reference: T. Heath, R.P. Martin, T.D. Nguyen, Improving cluster availability using workstation validation, in: Proceeding of ACM International Conference on Measurement and modeling of computer systems, SIGMETRICS, 2002. – reference: E. Schuchman, T.N. Vijaykumar, BlackJack: hard error detection with redundant threads on SMT, in: Proceeding of IEEE International Conference on Dependable Systems and Networks, DSN, 2007. – reference: S. Fu, C.-Z. Xu, Quantifying temporal and spatial correlation of failure events for proactive management, in: Proceeding of IEEE International Symposium on Reliable Distributed Systems, SRDS, 2007. – reference: J. Xu, Z. Kalbarczyk, R.K. Iyer, Networked windows NT system field failure data analysis, in: Proceeding of Pacific Rim Symposium on Dependable Computing, PRDC, 1999. – reference: B. Schroeder, G. Gibson, A large-scale study of failures in HPC systems, in: Proceeding of IEEE International Conference on Dependable Systems and Networks, DSN, 2006. – reference: D. Tang, R.K. Iyer, S.S. Subramani, Failure analysis and modelling of a VAX cluster system, in: Proceeding of IEEE International Symposium on Fault-Tolerant Computing, FTCS, 1990. – reference: R.K. Sahoo, A. Sivasubramaniam, M.S. Squillante, Y. Zhang, Failure data analysis of a large-scale heterogeneous server environment, in: Proceeding of IEEE International Conference on Dependable Systems and Networks, DSN, 2004. – reference: Y. Liang, Y. Zhang, A. Sivasubramaniam, R. Sahoo, J. Moreira, M. Gupta, Filtering failure logs for a BlueGene/L prototype, in: Proceeding of IEEE International Conference on Dependable Systems and Networks, DSN, 2005. – reference: J. Dunagan, N.J.A. Harvey, M.B. Jones, D. Kostic, M. Theimer, A. Wolman, FUSE: Lightweight guaranteed distributed failure notification, in: Proceeding of USENIX Symposium on Operating Systems Design and Implementation, OSDI, 2004. – reference: Data lifeguard. Available at: – reference: Y. Liang, Y. Zhang, A. Sivasubramaniam, M. Jette, R.K. Sahoo, BlueGene/L failure analysis and prediction models, in: Proceeding of IEEE International Conference on Dependable Systems and Networks, DSN, 2006. – reference: S. Fu, Failure-aware construction and reconfiguration of distributed virtual machines for high availability computing, in: Proceeding of IEEE/ACM International Symposium on Cluster Computing and the Grid, CCGrid, 2009. – reference: R. Vilalta, S. Ma, Predicting rare events in temporal domains, in: Proceeding of IEEE International Conference on Data Mining, ICDM, 2002. – reference: Weka: the University of Waikato, machine learning software in java. Available at: – reference: M. Wiesmann, P. Urban, X. Defago, An SNMP based failure detection service, in: Proceeding of IEEE International Symposium on Reliable Distributed Systems, SRDS, 2006. – reference: Wayne State University, Grid computing. Available at: – reference: P. Yalagandula, S. Nath, H. Yu, P.B. Gibbons, S. Sesha, Beyond availability: towards a deeper understanding of machine failure characteristics in large distributed systems, in: Proceeding of USENIX WORLDS, 2004. – volume: 4 start-page: 214 year: 1986 end-page: 237 ident: br000095 article-title: Measurement and modeling of computer reliability as affected by system activity publication-title: ACM Transactions on Computer Systems – reference: K. Vaidyanathan, R.E. Harper, S.W. Hunter, K.S. Trivedi, Analysis and implementation of software rejuvenation in cluster systems, in: Proceeding of ACM International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS, 2001. – reference: J. Mickens, B. Noble, Exploiting availability prediction in distributed systems, in: Proceeding of USENIX Symposium on Networked Systems Design and Implementation, NSDI, 2006. – reference: G.M. Weiss, H. Hirsh, Learning to predict rare events in event sequences, in: Proceeding of ACM International Conference on Knowledge Discovery and Data Mining, SIGKDD, 1998. – reference: Gretl: GNU regression, econometrics and time-series library. Available at: – reference: B. Chun, A. Vahdat, Workload and failure characterization on a large-scale federated testbed, Technical Report IRB-TR-03-040, Intel Research Berkeley, 2003. – reference: H. Berenji, J. Ametha, D. Vengerov, Inductive learning for fault diagnosis, in: Proceeding of IEEE International Conference on Fuzzy Systems, 2003. – reference: X. CastiUo, D.P. Siewlorek, Workload, performance and reliability of digital computing systems, in: Proceeding of Symposium on Fault-Tolerant Computing, FTCS, 1981. – volume: 42 start-page: 5 year: 2003 end-page: 18 ident: br000075 article-title: The dawning of the autonomic computing era publication-title: IBM Systems Journal – volume: 36 start-page: 41 year: 2003 end-page: 50 ident: br000100 article-title: The vision of autonomic computing publication-title: IEEE Computer – reference: J. Srinivasan, S.V. Adve, P. Bose, J.A. Rivers, A reliability odometer—lemon check your processor, in: Proceeding of Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS, 2004. – reference: Z. Zhang, S. Fu, Failure prediction for autonomic management of networked computer systems with availability assurance, in: Proceeding of IEEE International Workshop on Dependable Parallel, Distributed and Network-Centric Systems, in conjunction with IEEE International Parallel and Distributed Processing Symposium, IPDPS, 2010. – volume: 96 start-page: 1361 year: 2001 end-page: 1374 ident: br000010 article-title: Objective Bayesian analysis of spatially correlated data publication-title: Journal of the American Statistical Association – reference: D. Tang, R.K. Iyer, Impact of correlated failures on dependability in a VAX cluster system, in: Proceeding of IFIP Working Conference on Dependable Computing for Critical Applications, 1991. – reference: J. Meyer, L. Wei, Analysis of workload influence on dependability, in: Proceeding of IEEE International Symposium on Fault-Tolerant Computing, FTCS, 1988. – volume: 51 start-page: 350 year: 2002 end-page: 357 ident: br000090 article-title: Improved disk-drive failure warnings publication-title: IEEE Transactions on Reliability – reference: V.U.B. Challagulla, F.B. Bastani, I.-L. Yen, R.A. Paul, Empirical assessment of machine learning based software defect prediction techniques, in: Proceeding of Workshop on Object-Oriented Real-Time Dependable Systems, 2005. – reference: Siguardian. Available at: – reference: . – ident: 10.1016/j.jpdc.2010.06.010_br000145 – ident: 10.1016/j.jpdc.2010.06.010_br000015 – ident: 10.1016/j.jpdc.2010.06.010_br000120 – ident: 10.1016/j.jpdc.2010.06.010_br000155 – ident: 10.1016/j.jpdc.2010.06.010_br000210 – ident: 10.1016/j.jpdc.2010.06.010_br000080 – ident: 10.1016/j.jpdc.2010.06.010_br000195 – ident: 10.1016/j.jpdc.2010.06.010_br000040 doi: 10.1109/CCGRID.2009.21 – ident: 10.1016/j.jpdc.2010.06.010_br000110 – volume: 51 start-page: 350 issue: 3 year: 2002 ident: 10.1016/j.jpdc.2010.06.010_br000090 article-title: Improved disk-drive failure warnings publication-title: IEEE Transactions on Reliability doi: 10.1109/TR.2002.802886 – ident: 10.1016/j.jpdc.2010.06.010_br000005 – ident: 10.1016/j.jpdc.2010.06.010_br000030 – ident: 10.1016/j.jpdc.2010.06.010_br000105 – ident: 10.1016/j.jpdc.2010.06.010_br000180 doi: 10.1145/378420.378434 – ident: 10.1016/j.jpdc.2010.06.010_br000070 doi: 10.1109/SRDS.2007.4365694 – ident: 10.1016/j.jpdc.2010.06.010_br000200 – volume: 4 start-page: 214 issue: 3 year: 1986 ident: 10.1016/j.jpdc.2010.06.010_br000095 article-title: Measurement and modeling of computer reliability as affected by system activity publication-title: ACM Transactions on Computer Systems doi: 10.1145/6420.6422 – ident: 10.1016/j.jpdc.2010.06.010_br000150 doi: 10.1109/DSN.2007.23 – volume: 36 start-page: 41 issue: 1 year: 2003 ident: 10.1016/j.jpdc.2010.06.010_br000100 article-title: The vision of autonomic computing publication-title: IEEE Computer doi: 10.1109/MC.2003.1160055 – volume: 13 start-page: 1135 issue: 10 year: 1987 ident: 10.1016/j.jpdc.2010.06.010_br000125 article-title: On the reliability of the IBM MVS/XA operating system publication-title: IEEE Transactions on Software Engineering doi: 10.1109/TSE.1987.232855 – ident: 10.1016/j.jpdc.2010.06.010_br000185 – ident: 10.1016/j.jpdc.2010.06.010_br000205 doi: 10.1109/SRDS.2006.9 – ident: 10.1016/j.jpdc.2010.06.010_br000055 – ident: 10.1016/j.jpdc.2010.06.010_br000025 doi: 10.1088/1126-6708/2003/03/040 – volume: 66 start-page: 1442 issue: 11 year: 2006 ident: 10.1016/j.jpdc.2010.06.010_br000060 article-title: Stochastic modeling and analysis of hybrid mobility in reconfigurable distributed virtual machines publication-title: Journal of Parallel and Distributed Computing doi: 10.1016/j.jpdc.2006.05.006 – ident: 10.1016/j.jpdc.2010.06.010_br000045 doi: 10.1109/ISCC.2010.5546715 – ident: 10.1016/j.jpdc.2010.06.010_br000140 doi: 10.1109/IPDPS.2006.1639672 – ident: 10.1016/j.jpdc.2010.06.010_br000130 doi: 10.1145/956790.956799 – ident: 10.1016/j.jpdc.2010.06.010_br000085 doi: 10.1145/511361.511362 – ident: 10.1016/j.jpdc.2010.06.010_br000190 – volume: 96 start-page: 1361 issue: 456 year: 2001 ident: 10.1016/j.jpdc.2010.06.010_br000010 article-title: Objective Bayesian analysis of spatially correlated data publication-title: Journal of the American Statistical Association doi: 10.1198/016214501753382282 – ident: 10.1016/j.jpdc.2010.06.010_br000020 – ident: 10.1016/j.jpdc.2010.06.010_br000170 doi: 10.1007/978-3-7091-9198-9_9 – volume: 70 start-page: 384 issue: 4 year: 2010 ident: 10.1016/j.jpdc.2010.06.010_br000050 article-title: Failure-aware resource management for high-availability computing clusters with distributed virtual machines publication-title: Journal of Parallel and Distributed Computing doi: 10.1016/j.jpdc.2010.01.002 – ident: 10.1016/j.jpdc.2010.06.010_br000065 doi: 10.1145/1362622.1362678 – ident: 10.1016/j.jpdc.2010.06.010_br000175 – ident: 10.1016/j.jpdc.2010.06.010_br000135 doi: 10.1109/DSN.2004.1311948 – volume: 42 start-page: 5 issue: 1 year: 2003 ident: 10.1016/j.jpdc.2010.06.010_br000075 article-title: The dawning of the autonomic computing era publication-title: IBM Systems Journal doi: 10.1147/sj.421.0005 – ident: 10.1016/j.jpdc.2010.06.010_br000220 – ident: 10.1016/j.jpdc.2010.06.010_br000215 – ident: 10.1016/j.jpdc.2010.06.010_br000115 – ident: 10.1016/j.jpdc.2010.06.010_br000165 – ident: 10.1016/j.jpdc.2010.06.010_br000160 doi: 10.1109/SRDS.2006.16 – ident: 10.1016/j.jpdc.2010.06.010_br000035
SSID	ssj0011578
Score	2.1127307
Snippet	Networked computing systems continue to grow in scale and in the complexity of their components and interactions. Component failures become norms instead of...
SourceID	proquest pascalfrancis crossref elsevier
SourceType	Aggregation Database Index Database Enrichment Source Publisher
StartPage	1100
SubjectTerms	Applied sciences Associations Autonomic management Clusters Computer science; control theory; systems Computer systems and distributed systems. User interface Correlation Dynamical systems Dynamics Exact sciences and technology Failure Failure characterization Mathematical models Networked computing systems On-line systems Online Software Spatial correlation System availability Temporal correlation
Title	Quantifying event correlations for proactive failure management in networked computing systems
URI	https://dx.doi.org/10.1016/j.jpdc.2010.06.010 https://www.proquest.com/docview/760216527
Volume	70
WOSCitedRecordID	wos000282191700002&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
journalDatabaseRights	– providerCode: PRVESC databaseName: Elsevier SD Freedom Collection Journals 2021 customDbUrl: eissn: 1096-0848 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0011578 issn: 0743-7315 databaseCode: AIEXJ dateStart: 19950101 isFulltext: true titleUrlDefault: https://www.sciencedirect.com providerName: Elsevier
link	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1Lj9MwELagywEJ8UaUx8oHxKXKqo7z8nGFugJUFZC6UsQBy3HspVWVhk2D9uczfiTNsrDAgUtUpXnJ3-fx2P5mBqFXVJBCaRIHUjINExToUiKKRBDrKMmEoGVYWKTn6WKR5Tn76GVjjS0nkFZVdnHB6v8KNZwDsE3o7D_A3T8UTsBvAB2OADsc_wr4T60wAiAbvmTTM02kqcDRad6MrNCGUbmE32JldOlexGqFAatqUjltuLIRb3VrldHNILX5VWfWZBDfbJTLO1CaZLymjtbwAT1RWrvgut2fyVu376-qs-Dz1-4PvxJhVB39SoQzWCbdaUpdeGZnXV1ZkI5FZGArTbK6wbhrUp_-0qa75YX10boupdfiJUdTL4a9lEB78YGfnM7nfDnLl6_rb4GpLWb24H2hlZvoIExjlo3QwfG7Wf6-320isRuxu-_3wVVOB_jza3_nwNypRQPdSrt6KFeGduuvLO-jux4bfOwI8gDdUNVDdK8r4oG9TX-Evgz4gi1f8JAvGPiCe75gzxe85wteVbjnC-7hxp4vj9HpyWz55m3gi24EEnyZXaCyoogEE6GiCQsJFWWohYBZQCGVlulUZNL4VZpA749UEbKSwaCZFKXSYZnEgj5Bo2pbqacIU5nQREDXFERFqZgW04zpQouMRJKBWztGpGtJLn1GelMYZcM76eGam9bnpvW50V-S6RhN-ntql4_l2qvjDiDuPUrnKXIg17X3HV5Cs38VzD8YpTQbI9zBy8Eem002Ualt2_A0Aa85icP02Z8veY5u7_vRCzTanbfqJbolv-9WzfmhZ-kPYxa0eg
linkProvider	Elsevier
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Quantifying+event+correlations+for+proactive+failure+management+in+networked+computing+systems&rft.jtitle=Journal+of+parallel+and+distributed+computing&rft.au=Fu%2C+Song&rft.au=Xu%2C+Cheng-Zhong&rft.date=2010-11-01&rft.issn=0743-7315&rft.volume=70&rft.issue=11&rft.spage=1100&rft.epage=1109&rft_id=info:doi/10.1016%2Fj.jpdc.2010.06.010&rft.externalDBID=NO_FULL_TEXT
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0743-7315&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0743-7315&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0743-7315&client=summon