Quantifying event correlations for proactive failure management in networked computing systems
Networked computing systems continue to grow in scale and in the complexity of their components and interactions. Component failures become norms instead of exceptions in these environments. Moreover, failure events exhibit strong correlations in the time and space domains. In this paper, we develop...
Uloženo v:
| Vydáno v: | Journal of parallel and distributed computing Ročník 70; číslo 11; s. 1100 - 1109 |
|---|---|
| Hlavní autoři: | , |
| Médium: | Journal Article |
| Jazyk: | angličtina |
| Vydáno: |
Amsterdam
Elsevier Inc
01.11.2010
Elsevier |
| Témata: | |
| ISSN: | 0743-7315, 1096-0848 |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Abstract | Networked computing systems continue to grow in scale and in the complexity of their components and interactions. Component failures become norms instead of exceptions in these environments. Moreover, failure events exhibit strong correlations in the time and space domains. In this paper, we develop a spherical covariance model with an adjustable timescale parameter to quantify the temporal correlation and a stochastic model to characterize spatial correlation. The models are further extended to take into account the information of application allocation to discover more correlations among failure instances. We cluster failure events based on their correlations and predict their future occurrences. Experimental results on a production coalition system, the Wayne State Computational Grid, show the offline and online predictions made by our predicting system can forecast 72.7–85.3% of the failure occurrences and capture failure correlations in a cluster coalition environment.
► Temporal and spatial failure correlations are modeled and quantified to characterize failure dynamics. ► A prediction mechanism that explores failure correlations is proposed to forecast future failure occurrences. ► High prediction accuracy is achieved in offline and online predictions on a production networked computer system. |
|---|---|
| AbstractList | Networked computing systems continue to grow in scale and in the complexity of their components and interactions. Component failures become norms instead of exceptions in these environments. Moreover, failure events exhibit strong correlations in the time and space domains. In this paper, we develop a spherical covariance model with an adjustable timescale parameter to quantify the temporal correlation and a stochastic model to characterize spatial correlation. The models are further extended to take into account the information of application allocation to discover more correlations among failure instances. We cluster failure events based on their correlations and predict their future occurrences. Experimental results on a production coalition system, the Wayne State Computational Grid, show the offline and online predictions made by our predicting system can forecast 72.7–85.3% of the failure occurrences and capture failure correlations in a cluster coalition environment.
► Temporal and spatial failure correlations are modeled and quantified to characterize failure dynamics. ► A prediction mechanism that explores failure correlations is proposed to forecast future failure occurrences. ► High prediction accuracy is achieved in offline and online predictions on a production networked computer system. Networked computing systems continue to grow in scale and in the complexity of their components and interactions. Component failures become norms instead of exceptions in these environments. Moreover, failure events exhibit strong correlations in the time and space domains. In this paper, we develop a spherical covariance model with an adjustable timescale parameter to quantify the temporal correlation and a stochastic model to characterize spatial correlation. The models are further extended to take into account the information of application allocation to discover more correlations among failure instances. We cluster failure events based on their correlations and predict their future occurrences. Experimental results on a production coalition system, the Wayne State Computational Grid, show the offline and online predictions made by our predicting system can forecast 72.7-85.3% of the failure occurrences and capture failure correlations in a cluster coalition environment. a-[ordm Temporal and spatial failure correlations are modeled and quantified to characterize failure dynamics. a-[ordm A prediction mechanism that explores failure correlations is proposed to forecast future failure occurrences. a-[ordm High prediction accuracy is achieved in offline and online predictions on a production networked computer system. |
| Author | Fu, Song Xu, Cheng-Zhong |
| Author_xml | – sequence: 1 givenname: Song surname: Fu fullname: Fu, Song email: songfu@unt.edu, song@nmt.edu organization: Department of Computer Science and Engineering, University of North Texas, United States – sequence: 2 givenname: Cheng-Zhong surname: Xu fullname: Xu, Cheng-Zhong email: czxu@wayne.edu organization: Department of Electrical and Computer Engineering, Wayne State University, United States |
| BackLink | http://pascal-francis.inist.fr/vibad/index.php?action=getRecordDetail&idt=23293338$$DView record in Pascal Francis |
| BookMark | eNp9kEtr3DAURkVIIZPHH-jKm9KVp3qMZRuyCaFNCoFSSLcR19JV0NSWJpI8Zf595U7IoousPhDnXMQ5J6c-eCTkI6NrRpn8sl1vd0avOS0PVK7LnJAVo72sabfpTsmKthtRt4I1Z-Q8pS2ljDVttyJPP2fw2dmD888V7tHnSocYcYTsgk-VDbHaxQA6uz1WFtw4R6wm8PCM00I7X3nMf0L8jaao027Oy6l0SBmndEk-WBgTXr3uBfn17evj7X398OPu--3NQ62F5LnGbhg20ANHIXvOBBhuAbhgg0arWwqd1ojGMhxwgwPvTc8aIQeDlhvZgLggn493y19fZkxZTS5pHEfwGOakWkk5kw1vC_nplYSkYbQRvHZJ7aKbIB4UF7wXQnSF40dOx5BSRPuGMKqW5mqrluZqaa6oVGWK1P0naZf_lcyxpHtfvT6qWDLtHUaVtEOv0biIOisT3Hv6X_cLos4 |
| CitedBy_id | crossref_primary_10_1016_j_eswa_2014_09_014 crossref_primary_10_1007_s11277_017_4582_8 crossref_primary_10_1016_j_jpdc_2012_09_007 crossref_primary_10_3390_s18061844 crossref_primary_10_1007_s10664_014_9303_2 crossref_primary_10_1016_j_infsof_2019_06_011 crossref_primary_10_1016_j_jnca_2010_07_011 crossref_primary_10_1016_j_jpdc_2012_06_012 |
| Cites_doi | 10.1109/CCGRID.2009.21 10.1109/TR.2002.802886 10.1145/378420.378434 10.1109/SRDS.2007.4365694 10.1145/6420.6422 10.1109/DSN.2007.23 10.1109/MC.2003.1160055 10.1109/TSE.1987.232855 10.1109/SRDS.2006.9 10.1088/1126-6708/2003/03/040 10.1016/j.jpdc.2006.05.006 10.1109/ISCC.2010.5546715 10.1109/IPDPS.2006.1639672 10.1145/956790.956799 10.1145/511361.511362 10.1198/016214501753382282 10.1007/978-3-7091-9198-9_9 10.1016/j.jpdc.2010.01.002 10.1145/1362622.1362678 10.1109/DSN.2004.1311948 10.1147/sj.421.0005 10.1109/SRDS.2006.16 |
| ContentType | Journal Article |
| Copyright | 2010 Elsevier Inc. 2015 INIST-CNRS |
| Copyright_xml | – notice: 2010 Elsevier Inc. – notice: 2015 INIST-CNRS |
| DBID | AAYXX CITATION IQODW 7SC 8FD JQ2 L7M L~C L~D |
| DOI | 10.1016/j.jpdc.2010.06.010 |
| DatabaseName | CrossRef Pascal-Francis Computer and Information Systems Abstracts Technology Research Database ProQuest Computer Science Collection Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional |
| DatabaseTitle | CrossRef Computer and Information Systems Abstracts Technology Research Database Computer and Information Systems Abstracts – Academic Advanced Technologies Database with Aerospace ProQuest Computer Science Collection Computer and Information Systems Abstracts Professional |
| DatabaseTitleList | Computer and Information Systems Abstracts |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Computer Science Applied Sciences |
| EISSN | 1096-0848 |
| EndPage | 1109 |
| ExternalDocumentID | 23293338 10_1016_j_jpdc_2010_06_010 S0743731510001218 |
| GroupedDBID | --K --M -~X .~1 0R~ 1B1 1~. 1~5 29L 4.4 457 4G. 5GY 5VS 7-5 71M 8P~ 9JN AACTN AAEDT AAEDW AAIAV AAIKJ AAKOC AALRI AAOAW AAQFI AAQXK AAXUO AAYFN ABBOA ABEFU ABFNM ABFSI ABJNI ABMAC ABTAH ABXDB ABYKQ ACDAQ ACGFS ACNNM ACRLP ACZNC ADBBV ADEZE ADFGL ADHUB ADJOM ADMUD ADTZH AEBSH AECPX AEKER AENEX AFKWA AFTJW AGHFR AGUBO AGYEJ AHHHB AHJVU AHZHX AIALX AIEXJ AIKHN AITUG AJBFU AJOXV ALMA_UNASSIGNED_HOLDINGS AMFUW AMRAJ AOUOD ASPBG AVWKF AXJTR AZFZN BJAXD BKOJK BLXMC CAG COF CS3 DM4 DU5 E.L EBS EFBJH EFLBG EJD EO8 EO9 EP2 EP3 F5P FDB FEDTE FGOYB FIRID FNPLU FYGXN G-2 G-Q G8K GBLVA GBOLZ HLZ HVGLF HZ~ H~9 IHE J1W JJJVA K-O KOM LG5 LG9 LY7 M41 MO0 N9A O-L O9- OAUVE OZT P-8 P-9 P2P PC. Q38 R2- RIG ROL RPZ SBC SDF SDG SDP SES SET SEW SPC SPCBC SST SSV SSZ T5K TN5 TWZ WUQ XJT XOL XPP ZMT ZU3 ZY4 ~G- ~G0 9DU AATTM AAXKI AAYWO AAYXX ABDPE ABWVN ACLOT ACRPL ACVFH ADCNI ADNMO ADVLN AEIPS AEUPX AFJKZ AFPUW AGQPQ AIGII AIIUN AKBMS AKRWK AKYEP ANKPU APXCP CITATION EFKBS ~HD AFXIZ AGCQF AGRNS BNPGV IQODW SSH 7SC 8FD JQ2 L7M L~C L~D |
| ID | FETCH-LOGICAL-c362t-e8bb4a9a2e369213ad2faa231bcefc70a8cceedf1ebe4eb29d91536bdef2d65a3 |
| ISICitedReferencesCount | 28 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000282191700002&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 0743-7315 |
| IngestDate | Sun Nov 09 14:17:57 EST 2025 Mon Jul 21 09:12:14 EDT 2025 Sat Nov 29 07:13:31 EST 2025 Tue Nov 18 22:42:22 EST 2025 Fri Feb 23 02:27:56 EST 2024 |
| IsPeerReviewed | true |
| IsScholarly | true |
| Issue | 11 |
| Keywords | Autonomic management Failure characterization Networked computing systems Temporal correlation System availability Spatial correlation Availability Autonomous system Production system Probabilistic approach Grid Interconnected power system Time correlation Network management Distributed system Distributed computing Modeling Reactive system Model matching Covariance Coalition Breakdown Proactive service |
| Language | English |
| License | CC BY 4.0 |
| LinkModel | OpenURL |
| MergedId | FETCHMERGED-LOGICAL-c362t-e8bb4a9a2e369213ad2faa231bcefc70a8cceedf1ebe4eb29d91536bdef2d65a3 |
| Notes | ObjectType-Article-2 SourceType-Scholarly Journals-1 ObjectType-Feature-1 content type line 23 |
| PQID | 760216527 |
| PQPubID | 23500 |
| PageCount | 10 |
| ParticipantIDs | proquest_miscellaneous_760216527 pascalfrancis_primary_23293338 crossref_primary_10_1016_j_jpdc_2010_06_010 crossref_citationtrail_10_1016_j_jpdc_2010_06_010 elsevier_sciencedirect_doi_10_1016_j_jpdc_2010_06_010 |
| PublicationCentury | 2000 |
| PublicationDate | 2010-11-01 |
| PublicationDateYYYYMMDD | 2010-11-01 |
| PublicationDate_xml | – month: 11 year: 2010 text: 2010-11-01 day: 01 |
| PublicationDecade | 2010 |
| PublicationPlace | Amsterdam |
| PublicationPlace_xml | – name: Amsterdam |
| PublicationTitle | Journal of parallel and distributed computing |
| PublicationYear | 2010 |
| Publisher | Elsevier Inc Elsevier |
| Publisher_xml | – name: Elsevier Inc – name: Elsevier |
| References | Weka: the University of Waikato, machine learning software in java. Available at R.K. Sahoo, A.J. Oliner, I. Rish, et al. Critical event prediction for proactive management in large-scale computer clusters, in: Proceeding of ACM Conference on Knowledge Discovery and Data Mining, SIGKDD, 2003. Y. Liang, Y. Zhang, A. Sivasubramaniam, M. Jette, R.K. Sahoo, BlueGene/L failure analysis and prediction models, in: Proceeding of IEEE International Conference on Dependable Systems and Networks, DSN, 2006. P. Yalagandula, S. Nath, H. Yu, P.B. Gibbons, S. Sesha, Beyond availability: towards a deeper understanding of machine failure characteristics in large distributed systems, in: Proceeding of USENIX WORLDS, 2004. Fu (br000050) 2010; 70 R. Vilalta, S. Ma, Predicting rare events in temporal domains, in: Proceeding of IEEE International Conference on Data Mining, ICDM, 2002. Data lifeguard. Available at T. Heath, R.P. Martin, T.D. Nguyen, Improving cluster availability using workstation validation, in: Proceeding of ACM International Conference on Measurement and modeling of computer systems, SIGMETRICS, 2002. S. Fu, C.-Z. Xu, Quantifying temporal and spatial correlation of failure events for proactive management, in: Proceeding of IEEE International Symposium on Reliable Distributed Systems, SRDS, 2007. K. Vaidyanathan, R.E. Harper, S.W. Hunter, K.S. Trivedi, Analysis and implementation of software rejuvenation in cluster systems, in: Proceeding of ACM International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS, 2001. Fu, Xu (br000060) 2006; 66 Iyer, Rossetti, Hsueh (br000095) 1986; 4 J. Dunagan, N.J.A. Harvey, M.B. Jones, D. Kostic, M. Theimer, A. Wolman, FUSE: Lightweight guaranteed distributed failure notification, in: Proceeding of USENIX Symposium on Operating Systems Design and Implementation, OSDI, 2004. Ganek, Corbi (br000075) 2003; 42 J. Xu, Z. Kalbarczyk, R.K. Iyer, Networked windows NT system field failure data analysis, in: Proceeding of Pacific Rim Symposium on Dependable Computing, PRDC, 1999. F. Salfner, M. Schieschke, M. Malek, Predicting failures of computer systems: a case study for a telecommunication system, in: Proceeding of Workshop on Dependable Parallel, Distributed and Network-Centric Systems in Conjunction with International Parallel and Distributed Processing Symposium, 2006. S. Fu, Failure-aware construction and reconfiguration of distributed virtual machines for high availability computing, in: Proceeding of IEEE/ACM International Symposium on Cluster Computing and the Grid, CCGrid, 2009. B. Schroeder, G. Gibson, A large-scale study of failures in HPC systems, in: Proceeding of IEEE International Conference on Dependable Systems and Networks, DSN, 2006. N. Sridhar, Decentralized local failure detection in dynamic distributed systems, in: Proceeding of IEEE International Symposium on Reliable Distributed Systems, SRDS, 2006. J. Srinivasan, S.V. Adve, P. Bose, J.A. Rivers, A reliability odometer—lemon check your processor, in: Proceeding of Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS, 2004. J. Mickens, B. Noble, Exploiting availability prediction in distributed systems, in: Proceeding of USENIX Symposium on Networked Systems Design and Implementation, NSDI, 2006. B. Chun, A. Vahdat, Workload and failure characterization on a large-scale federated testbed, Technical Report IRB-TR-03-040, Intel Research Berkeley, 2003. D. Tang, R.K. Iyer, Impact of correlated failures on dependability in a VAX cluster system, in: Proceeding of IFIP Working Conference on Dependable Computing for Critical Applications, 1991. S. Fu, Dependability enhancement for coalition clusters with autonomic failure management, in: Proceeding of IEEE International Symposium on Computers and Communications, ISCC, 2010. Hughes, Murray, Kreutz-Delgado, Elkan (br000090) 2002; 51 Mourad, Andrews (br000125) 1987; 13 Berger, Oliveira, Sansó (br000010) 2001; 96 D. Tang, R.K. Iyer, S.S. Subramani, Failure analysis and modelling of a VAX cluster system, in: Proceeding of IEEE International Symposium on Fault-Tolerant Computing, FTCS, 1990. H. Berenji, J. Ametha, D. Vengerov, Inductive learning for fault diagnosis, in: Proceeding of IEEE International Conference on Fuzzy Systems, 2003. Gretl: GNU regression, econometrics and time-series library. Available at V.U.B. Challagulla, F.B. Bastani, I.-L. Yen, R.A. Paul, Empirical assessment of machine learning based software defect prediction techniques, in: Proceeding of Workshop on Object-Oriented Real-Time Dependable Systems, 2005. . J. Meyer, L. Wei, Analysis of workload influence on dependability, in: Proceeding of IEEE International Symposium on Fault-Tolerant Computing, FTCS, 1988. Kephart, Chess (br000100) 2003; 36 Y. Liang, Y. Zhang, A. Sivasubramaniam, R. Sahoo, J. Moreira, M. Gupta, Filtering failure logs for a BlueGene/L prototype, in: Proceeding of IEEE International Conference on Dependable Systems and Networks, DSN, 2005. E. Schuchman, T.N. Vijaykumar, BlackJack: hard error detection with redundant threads on SMT, in: Proceeding of IEEE International Conference on Dependable Systems and Networks, DSN, 2007. Siguardian. Available at S. Fu, C.-Z. Xu, Exploring event correlation for failure prediction in coalitions of clusters, in: Proceeding of ACM/IEEE Conference on Supercomputing, SC, 2007. Z. Zhang, S. Fu, Failure prediction for autonomic management of networked computer systems with availability assurance, in: Proceeding of IEEE International Workshop on Dependable Parallel, Distributed and Network-Centric Systems, in conjunction with IEEE International Parallel and Distributed Processing Symposium, IPDPS, 2010. R.K. Sahoo, A. Sivasubramaniam, M.S. Squillante, Y. Zhang, Failure data analysis of a large-scale heterogeneous server environment, in: Proceeding of IEEE International Conference on Dependable Systems and Networks, DSN, 2004. Wayne State University, Grid computing. Available at G.M. Weiss, H. Hirsh, Learning to predict rare events in event sequences, in: Proceeding of ACM International Conference on Knowledge Discovery and Data Mining, SIGKDD, 1998. M. Wiesmann, P. Urban, X. Defago, An SNMP based failure detection service, in: Proceeding of IEEE International Symposium on Reliable Distributed Systems, SRDS, 2006. X. CastiUo, D.P. Siewlorek, Workload, performance and reliability of digital computing systems, in: Proceeding of Symposium on Fault-Tolerant Computing, FTCS, 1981. S. Fu, C.-Z. Xu, Service migration in distributed virtual machines for adaptive grid computing, in: Proceeding of the International Conference on Parallel Processing, ICPP, 2005. 10.1016/j.jpdc.2010.06.010_br000150 10.1016/j.jpdc.2010.06.010_br000030 10.1016/j.jpdc.2010.06.010_br000195 10.1016/j.jpdc.2010.06.010_br000130 10.1016/j.jpdc.2010.06.010_br000175 10.1016/j.jpdc.2010.06.010_br000190 10.1016/j.jpdc.2010.06.010_br000070 Iyer (10.1016/j.jpdc.2010.06.010_br000095) 1986; 4 10.1016/j.jpdc.2010.06.010_br000170 Hughes (10.1016/j.jpdc.2010.06.010_br000090) 2002; 51 Fu (10.1016/j.jpdc.2010.06.010_br000050) 2010; 70 Ganek (10.1016/j.jpdc.2010.06.010_br000075) 2003; 42 Mourad (10.1016/j.jpdc.2010.06.010_br000125) 1987; 13 Kephart (10.1016/j.jpdc.2010.06.010_br000100) 2003; 36 10.1016/j.jpdc.2010.06.010_br000005 10.1016/j.jpdc.2010.06.010_br000105 10.1016/j.jpdc.2010.06.010_br000205 10.1016/j.jpdc.2010.06.010_br000165 10.1016/j.jpdc.2010.06.010_br000220 10.1016/j.jpdc.2010.06.010_br000045 10.1016/j.jpdc.2010.06.010_br000145 10.1016/j.jpdc.2010.06.010_br000200 10.1016/j.jpdc.2010.06.010_br000025 10.1016/j.jpdc.2010.06.010_br000040 10.1016/j.jpdc.2010.06.010_br000085 10.1016/j.jpdc.2010.06.010_br000140 10.1016/j.jpdc.2010.06.010_br000020 10.1016/j.jpdc.2010.06.010_br000185 10.1016/j.jpdc.2010.06.010_br000065 10.1016/j.jpdc.2010.06.010_br000120 10.1016/j.jpdc.2010.06.010_br000080 10.1016/j.jpdc.2010.06.010_br000180 10.1016/j.jpdc.2010.06.010_br000160 Berger (10.1016/j.jpdc.2010.06.010_br000010) 2001; 96 Fu (10.1016/j.jpdc.2010.06.010_br000060) 2006; 66 10.1016/j.jpdc.2010.06.010_br000015 10.1016/j.jpdc.2010.06.010_br000115 10.1016/j.jpdc.2010.06.010_br000215 10.1016/j.jpdc.2010.06.010_br000055 10.1016/j.jpdc.2010.06.010_br000110 10.1016/j.jpdc.2010.06.010_br000155 10.1016/j.jpdc.2010.06.010_br000210 10.1016/j.jpdc.2010.06.010_br000035 10.1016/j.jpdc.2010.06.010_br000135 |
| References_xml | – reference: S. Fu, C.-Z. Xu, Service migration in distributed virtual machines for adaptive grid computing, in: Proceeding of the International Conference on Parallel Processing, ICPP, 2005. – reference: R.K. Sahoo, A.J. Oliner, I. Rish, et al. Critical event prediction for proactive management in large-scale computer clusters, in: Proceeding of ACM Conference on Knowledge Discovery and Data Mining, SIGKDD, 2003. – reference: F. Salfner, M. Schieschke, M. Malek, Predicting failures of computer systems: a case study for a telecommunication system, in: Proceeding of Workshop on Dependable Parallel, Distributed and Network-Centric Systems in Conjunction with International Parallel and Distributed Processing Symposium, 2006. – volume: 70 start-page: 384 year: 2010 end-page: 393 ident: br000050 article-title: Failure-aware resource management for high-availability computing clusters with distributed virtual machines publication-title: Journal of Parallel and Distributed Computing – volume: 13 start-page: 1135 year: 1987 end-page: 1139 ident: br000125 article-title: On the reliability of the IBM MVS/XA operating system publication-title: IEEE Transactions on Software Engineering – reference: N. Sridhar, Decentralized local failure detection in dynamic distributed systems, in: Proceeding of IEEE International Symposium on Reliable Distributed Systems, SRDS, 2006. – reference: S. Fu, Dependability enhancement for coalition clusters with autonomic failure management, in: Proceeding of IEEE International Symposium on Computers and Communications, ISCC, 2010. – volume: 66 start-page: 1442 year: 2006 end-page: 1454 ident: br000060 article-title: Stochastic modeling and analysis of hybrid mobility in reconfigurable distributed virtual machines publication-title: Journal of Parallel and Distributed Computing – reference: S. Fu, C.-Z. Xu, Exploring event correlation for failure prediction in coalitions of clusters, in: Proceeding of ACM/IEEE Conference on Supercomputing, SC, 2007. – reference: T. Heath, R.P. Martin, T.D. Nguyen, Improving cluster availability using workstation validation, in: Proceeding of ACM International Conference on Measurement and modeling of computer systems, SIGMETRICS, 2002. – reference: E. Schuchman, T.N. Vijaykumar, BlackJack: hard error detection with redundant threads on SMT, in: Proceeding of IEEE International Conference on Dependable Systems and Networks, DSN, 2007. – reference: S. Fu, C.-Z. Xu, Quantifying temporal and spatial correlation of failure events for proactive management, in: Proceeding of IEEE International Symposium on Reliable Distributed Systems, SRDS, 2007. – reference: J. Xu, Z. Kalbarczyk, R.K. Iyer, Networked windows NT system field failure data analysis, in: Proceeding of Pacific Rim Symposium on Dependable Computing, PRDC, 1999. – reference: B. Schroeder, G. Gibson, A large-scale study of failures in HPC systems, in: Proceeding of IEEE International Conference on Dependable Systems and Networks, DSN, 2006. – reference: D. Tang, R.K. Iyer, S.S. Subramani, Failure analysis and modelling of a VAX cluster system, in: Proceeding of IEEE International Symposium on Fault-Tolerant Computing, FTCS, 1990. – reference: R.K. Sahoo, A. Sivasubramaniam, M.S. Squillante, Y. Zhang, Failure data analysis of a large-scale heterogeneous server environment, in: Proceeding of IEEE International Conference on Dependable Systems and Networks, DSN, 2004. – reference: Y. Liang, Y. Zhang, A. Sivasubramaniam, R. Sahoo, J. Moreira, M. Gupta, Filtering failure logs for a BlueGene/L prototype, in: Proceeding of IEEE International Conference on Dependable Systems and Networks, DSN, 2005. – reference: J. Dunagan, N.J.A. Harvey, M.B. Jones, D. Kostic, M. Theimer, A. Wolman, FUSE: Lightweight guaranteed distributed failure notification, in: Proceeding of USENIX Symposium on Operating Systems Design and Implementation, OSDI, 2004. – reference: Data lifeguard. Available at: – reference: Y. Liang, Y. Zhang, A. Sivasubramaniam, M. Jette, R.K. Sahoo, BlueGene/L failure analysis and prediction models, in: Proceeding of IEEE International Conference on Dependable Systems and Networks, DSN, 2006. – reference: S. Fu, Failure-aware construction and reconfiguration of distributed virtual machines for high availability computing, in: Proceeding of IEEE/ACM International Symposium on Cluster Computing and the Grid, CCGrid, 2009. – reference: R. Vilalta, S. Ma, Predicting rare events in temporal domains, in: Proceeding of IEEE International Conference on Data Mining, ICDM, 2002. – reference: Weka: the University of Waikato, machine learning software in java. Available at: – reference: M. Wiesmann, P. Urban, X. Defago, An SNMP based failure detection service, in: Proceeding of IEEE International Symposium on Reliable Distributed Systems, SRDS, 2006. – reference: Wayne State University, Grid computing. Available at: – reference: P. Yalagandula, S. Nath, H. Yu, P.B. Gibbons, S. Sesha, Beyond availability: towards a deeper understanding of machine failure characteristics in large distributed systems, in: Proceeding of USENIX WORLDS, 2004. – volume: 4 start-page: 214 year: 1986 end-page: 237 ident: br000095 article-title: Measurement and modeling of computer reliability as affected by system activity publication-title: ACM Transactions on Computer Systems – reference: K. Vaidyanathan, R.E. Harper, S.W. Hunter, K.S. Trivedi, Analysis and implementation of software rejuvenation in cluster systems, in: Proceeding of ACM International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS, 2001. – reference: J. Mickens, B. Noble, Exploiting availability prediction in distributed systems, in: Proceeding of USENIX Symposium on Networked Systems Design and Implementation, NSDI, 2006. – reference: G.M. Weiss, H. Hirsh, Learning to predict rare events in event sequences, in: Proceeding of ACM International Conference on Knowledge Discovery and Data Mining, SIGKDD, 1998. – reference: Gretl: GNU regression, econometrics and time-series library. Available at: – reference: B. Chun, A. Vahdat, Workload and failure characterization on a large-scale federated testbed, Technical Report IRB-TR-03-040, Intel Research Berkeley, 2003. – reference: H. Berenji, J. Ametha, D. Vengerov, Inductive learning for fault diagnosis, in: Proceeding of IEEE International Conference on Fuzzy Systems, 2003. – reference: X. CastiUo, D.P. Siewlorek, Workload, performance and reliability of digital computing systems, in: Proceeding of Symposium on Fault-Tolerant Computing, FTCS, 1981. – volume: 42 start-page: 5 year: 2003 end-page: 18 ident: br000075 article-title: The dawning of the autonomic computing era publication-title: IBM Systems Journal – volume: 36 start-page: 41 year: 2003 end-page: 50 ident: br000100 article-title: The vision of autonomic computing publication-title: IEEE Computer – reference: J. Srinivasan, S.V. Adve, P. Bose, J.A. Rivers, A reliability odometer—lemon check your processor, in: Proceeding of Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS, 2004. – reference: Z. Zhang, S. Fu, Failure prediction for autonomic management of networked computer systems with availability assurance, in: Proceeding of IEEE International Workshop on Dependable Parallel, Distributed and Network-Centric Systems, in conjunction with IEEE International Parallel and Distributed Processing Symposium, IPDPS, 2010. – volume: 96 start-page: 1361 year: 2001 end-page: 1374 ident: br000010 article-title: Objective Bayesian analysis of spatially correlated data publication-title: Journal of the American Statistical Association – reference: D. Tang, R.K. Iyer, Impact of correlated failures on dependability in a VAX cluster system, in: Proceeding of IFIP Working Conference on Dependable Computing for Critical Applications, 1991. – reference: J. Meyer, L. Wei, Analysis of workload influence on dependability, in: Proceeding of IEEE International Symposium on Fault-Tolerant Computing, FTCS, 1988. – volume: 51 start-page: 350 year: 2002 end-page: 357 ident: br000090 article-title: Improved disk-drive failure warnings publication-title: IEEE Transactions on Reliability – reference: V.U.B. Challagulla, F.B. Bastani, I.-L. Yen, R.A. Paul, Empirical assessment of machine learning based software defect prediction techniques, in: Proceeding of Workshop on Object-Oriented Real-Time Dependable Systems, 2005. – reference: Siguardian. Available at: – reference: . – ident: 10.1016/j.jpdc.2010.06.010_br000145 – ident: 10.1016/j.jpdc.2010.06.010_br000015 – ident: 10.1016/j.jpdc.2010.06.010_br000120 – ident: 10.1016/j.jpdc.2010.06.010_br000155 – ident: 10.1016/j.jpdc.2010.06.010_br000210 – ident: 10.1016/j.jpdc.2010.06.010_br000080 – ident: 10.1016/j.jpdc.2010.06.010_br000195 – ident: 10.1016/j.jpdc.2010.06.010_br000040 doi: 10.1109/CCGRID.2009.21 – ident: 10.1016/j.jpdc.2010.06.010_br000110 – volume: 51 start-page: 350 issue: 3 year: 2002 ident: 10.1016/j.jpdc.2010.06.010_br000090 article-title: Improved disk-drive failure warnings publication-title: IEEE Transactions on Reliability doi: 10.1109/TR.2002.802886 – ident: 10.1016/j.jpdc.2010.06.010_br000005 – ident: 10.1016/j.jpdc.2010.06.010_br000030 – ident: 10.1016/j.jpdc.2010.06.010_br000105 – ident: 10.1016/j.jpdc.2010.06.010_br000180 doi: 10.1145/378420.378434 – ident: 10.1016/j.jpdc.2010.06.010_br000070 doi: 10.1109/SRDS.2007.4365694 – ident: 10.1016/j.jpdc.2010.06.010_br000200 – volume: 4 start-page: 214 issue: 3 year: 1986 ident: 10.1016/j.jpdc.2010.06.010_br000095 article-title: Measurement and modeling of computer reliability as affected by system activity publication-title: ACM Transactions on Computer Systems doi: 10.1145/6420.6422 – ident: 10.1016/j.jpdc.2010.06.010_br000150 doi: 10.1109/DSN.2007.23 – volume: 36 start-page: 41 issue: 1 year: 2003 ident: 10.1016/j.jpdc.2010.06.010_br000100 article-title: The vision of autonomic computing publication-title: IEEE Computer doi: 10.1109/MC.2003.1160055 – volume: 13 start-page: 1135 issue: 10 year: 1987 ident: 10.1016/j.jpdc.2010.06.010_br000125 article-title: On the reliability of the IBM MVS/XA operating system publication-title: IEEE Transactions on Software Engineering doi: 10.1109/TSE.1987.232855 – ident: 10.1016/j.jpdc.2010.06.010_br000185 – ident: 10.1016/j.jpdc.2010.06.010_br000205 doi: 10.1109/SRDS.2006.9 – ident: 10.1016/j.jpdc.2010.06.010_br000055 – ident: 10.1016/j.jpdc.2010.06.010_br000025 doi: 10.1088/1126-6708/2003/03/040 – volume: 66 start-page: 1442 issue: 11 year: 2006 ident: 10.1016/j.jpdc.2010.06.010_br000060 article-title: Stochastic modeling and analysis of hybrid mobility in reconfigurable distributed virtual machines publication-title: Journal of Parallel and Distributed Computing doi: 10.1016/j.jpdc.2006.05.006 – ident: 10.1016/j.jpdc.2010.06.010_br000045 doi: 10.1109/ISCC.2010.5546715 – ident: 10.1016/j.jpdc.2010.06.010_br000140 doi: 10.1109/IPDPS.2006.1639672 – ident: 10.1016/j.jpdc.2010.06.010_br000130 doi: 10.1145/956790.956799 – ident: 10.1016/j.jpdc.2010.06.010_br000085 doi: 10.1145/511361.511362 – ident: 10.1016/j.jpdc.2010.06.010_br000190 – volume: 96 start-page: 1361 issue: 456 year: 2001 ident: 10.1016/j.jpdc.2010.06.010_br000010 article-title: Objective Bayesian analysis of spatially correlated data publication-title: Journal of the American Statistical Association doi: 10.1198/016214501753382282 – ident: 10.1016/j.jpdc.2010.06.010_br000020 – ident: 10.1016/j.jpdc.2010.06.010_br000170 doi: 10.1007/978-3-7091-9198-9_9 – volume: 70 start-page: 384 issue: 4 year: 2010 ident: 10.1016/j.jpdc.2010.06.010_br000050 article-title: Failure-aware resource management for high-availability computing clusters with distributed virtual machines publication-title: Journal of Parallel and Distributed Computing doi: 10.1016/j.jpdc.2010.01.002 – ident: 10.1016/j.jpdc.2010.06.010_br000065 doi: 10.1145/1362622.1362678 – ident: 10.1016/j.jpdc.2010.06.010_br000175 – ident: 10.1016/j.jpdc.2010.06.010_br000135 doi: 10.1109/DSN.2004.1311948 – volume: 42 start-page: 5 issue: 1 year: 2003 ident: 10.1016/j.jpdc.2010.06.010_br000075 article-title: The dawning of the autonomic computing era publication-title: IBM Systems Journal doi: 10.1147/sj.421.0005 – ident: 10.1016/j.jpdc.2010.06.010_br000220 – ident: 10.1016/j.jpdc.2010.06.010_br000215 – ident: 10.1016/j.jpdc.2010.06.010_br000115 – ident: 10.1016/j.jpdc.2010.06.010_br000165 – ident: 10.1016/j.jpdc.2010.06.010_br000160 doi: 10.1109/SRDS.2006.16 – ident: 10.1016/j.jpdc.2010.06.010_br000035 |
| SSID | ssj0011578 |
| Score | 2.1127307 |
| Snippet | Networked computing systems continue to grow in scale and in the complexity of their components and interactions. Component failures become norms instead of... |
| SourceID | proquest pascalfrancis crossref elsevier |
| SourceType | Aggregation Database Index Database Enrichment Source Publisher |
| StartPage | 1100 |
| SubjectTerms | Applied sciences Associations Autonomic management Clusters Computer science; control theory; systems Computer systems and distributed systems. User interface Correlation Dynamical systems Dynamics Exact sciences and technology Failure Failure characterization Mathematical models Networked computing systems On-line systems Online Software Spatial correlation System availability Temporal correlation |
| Title | Quantifying event correlations for proactive failure management in networked computing systems |
| URI | https://dx.doi.org/10.1016/j.jpdc.2010.06.010 https://www.proquest.com/docview/760216527 |
| Volume | 70 |
| WOSCitedRecordID | wos000282191700002&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVESC databaseName: Elsevier SD Freedom Collection Journals 2021 customDbUrl: eissn: 1096-0848 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0011578 issn: 0743-7315 databaseCode: AIEXJ dateStart: 19950101 isFulltext: true titleUrlDefault: https://www.sciencedirect.com providerName: Elsevier |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1Lj9MwELagywEJ8UaUx8oHxKXKqo7z8nGFugJUFZC6UsQBy3HspVWVhk2D9uczfiTNsrDAgUtUpXnJ3-fx2P5mBqFXVJBCaRIHUjINExToUiKKRBDrKMmEoGVYWKTn6WKR5Tn76GVjjS0nkFZVdnHB6v8KNZwDsE3o7D_A3T8UTsBvAB2OADsc_wr4T60wAiAbvmTTM02kqcDRad6MrNCGUbmE32JldOlexGqFAatqUjltuLIRb3VrldHNILX5VWfWZBDfbJTLO1CaZLymjtbwAT1RWrvgut2fyVu376-qs-Dz1-4PvxJhVB39SoQzWCbdaUpdeGZnXV1ZkI5FZGArTbK6wbhrUp_-0qa75YX10boupdfiJUdTL4a9lEB78YGfnM7nfDnLl6_rb4GpLWb24H2hlZvoIExjlo3QwfG7Wf6-320isRuxu-_3wVVOB_jza3_nwNypRQPdSrt6KFeGduuvLO-jux4bfOwI8gDdUNVDdK8r4oG9TX-Evgz4gi1f8JAvGPiCe75gzxe85wteVbjnC-7hxp4vj9HpyWz55m3gi24EEnyZXaCyoogEE6GiCQsJFWWohYBZQCGVlulUZNL4VZpA749UEbKSwaCZFKXSYZnEgj5Bo2pbqacIU5nQREDXFERFqZgW04zpQouMRJKBWztGpGtJLn1GelMYZcM76eGam9bnpvW50V-S6RhN-ntql4_l2qvjDiDuPUrnKXIg17X3HV5Cs38VzD8YpTQbI9zBy8Eem002Ualt2_A0Aa85icP02Z8veY5u7_vRCzTanbfqJbolv-9WzfmhZ-kPYxa0eg |
| linkProvider | Elsevier |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Quantifying+event+correlations+for+proactive+failure+management+in+networked+computing+systems&rft.jtitle=Journal+of+parallel+and+distributed+computing&rft.au=Fu%2C+Song&rft.au=Xu%2C+Cheng-Zhong&rft.date=2010-11-01&rft.issn=0743-7315&rft.volume=70&rft.issue=11&rft.spage=1100&rft.epage=1109&rft_id=info:doi/10.1016%2Fj.jpdc.2010.06.010&rft.externalDBID=NO_FULL_TEXT |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0743-7315&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0743-7315&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0743-7315&client=summon |