CONAN: Diagnosing Batch Failures for Cloud Systems

Failure diagnosis is critical to the maintenance of large-scale cloud systems, which has attracted tremendous attention from academia and industry over the last decade. In this paper, we focus on diagnosing batch failures, which occur to a batch of instances of the same subject (e.g., API requests,...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:IEEE/ACM International Conference on Software Engineering: Software Engineering in Practice (Online) s. 138 - 149
Hlavní autoři: Li, Liqun, Zhang, Xu, He, Shilin, Kang, Yu, Zhang, Hongyu, Ma, Minghua, Dang, Yingnong, Xu, Zhangwei, Rajmohan, Saravan, Lin, Qingwei, Zhang, Dongmei
Médium: Konferenční příspěvek
Jazyk:angličtina
Vydáno: IEEE 01.05.2023
Témata:
ISSN:2832-7659
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Abstract Failure diagnosis is critical to the maintenance of large-scale cloud systems, which has attracted tremendous attention from academia and industry over the last decade. In this paper, we focus on diagnosing batch failures, which occur to a batch of instances of the same subject (e.g., API requests, VMs, nodes, etc.), resulting in degraded service availability and performance. Manual investigation over a large volume of high-dimensional telemetry data (e.g., logs, traces, and metrics) is labor-intensive and time-consuming, like finding a needle in a haystack. Meanwhile, existing proposed approaches are usually tailored for specific scenarios, which hinders their applications in diverse scenarios. According to our experience with Azure and Microsoft 365 - two world-leading cloud systems, when batch failures happen, the procedure of finding the root cause can be abstracted as looking for contrast patterns by comparing two groups of instances, such as failed vs. succeeded, slow vs. normal, or during vs. before an anomaly. We thus propose CONAN, an efficient and flexible framework that can automatically extract contrast patterns from contextual data. CONAN has been successfully integrated into multiple diagnostic tools for various products, which proves its usefulness in diagnosing real-world batch failures.
AbstractList Failure diagnosis is critical to the maintenance of large-scale cloud systems, which has attracted tremendous attention from academia and industry over the last decade. In this paper, we focus on diagnosing batch failures, which occur to a batch of instances of the same subject (e.g., API requests, VMs, nodes, etc.), resulting in degraded service availability and performance. Manual investigation over a large volume of high-dimensional telemetry data (e.g., logs, traces, and metrics) is labor-intensive and time-consuming, like finding a needle in a haystack. Meanwhile, existing proposed approaches are usually tailored for specific scenarios, which hinders their applications in diverse scenarios. According to our experience with Azure and Microsoft 365 - two world-leading cloud systems, when batch failures happen, the procedure of finding the root cause can be abstracted as looking for contrast patterns by comparing two groups of instances, such as failed vs. succeeded, slow vs. normal, or during vs. before an anomaly. We thus propose CONAN, an efficient and flexible framework that can automatically extract contrast patterns from contextual data. CONAN has been successfully integrated into multiple diagnostic tools for various products, which proves its usefulness in diagnosing real-world batch failures.
Author Zhang, Dongmei
He, Shilin
Lin, Qingwei
Li, Liqun
Kang, Yu
Ma, Minghua
Zhang, Xu
Xu, Zhangwei
Rajmohan, Saravan
Dang, Yingnong
Zhang, Hongyu
Author_xml – sequence: 1
  givenname: Liqun
  surname: Li
  fullname: Li, Liqun
  organization: Microsoft Research
– sequence: 2
  givenname: Xu
  surname: Zhang
  fullname: Zhang, Xu
  organization: Microsoft Research
– sequence: 3
  givenname: Shilin
  surname: He
  fullname: He, Shilin
  organization: Microsoft Research
– sequence: 4
  givenname: Yu
  surname: Kang
  fullname: Kang, Yu
  organization: Microsoft Research
– sequence: 5
  givenname: Hongyu
  surname: Zhang
  fullname: Zhang, Hongyu
  organization: The University of Newcastle
– sequence: 6
  givenname: Minghua
  surname: Ma
  fullname: Ma, Minghua
  organization: Microsoft Research
– sequence: 7
  givenname: Yingnong
  surname: Dang
  fullname: Dang, Yingnong
  organization: Microsoft Azure
– sequence: 8
  givenname: Zhangwei
  surname: Xu
  fullname: Xu, Zhangwei
  organization: Microsoft Azure
– sequence: 9
  givenname: Saravan
  surname: Rajmohan
  fullname: Rajmohan, Saravan
  organization: Microsoft 365
– sequence: 10
  givenname: Qingwei
  surname: Lin
  fullname: Lin, Qingwei
  organization: Microsoft Research
– sequence: 11
  givenname: Dongmei
  surname: Zhang
  fullname: Zhang, Dongmei
  organization: Microsoft Research
BookMark eNotjk1OwzAYBQ0CiVJyAxbmAAmf7fiPXQkpRKpapMC6shO7WEoTFKeL3p4gWM1bjJ7mFl31Q-8QeiCQEQL6sSrqMq3L6p0rofKMAmUZABB1gRIttWIcGACT4hItqGI0lYLrG5TEGCxwypWUlC4QLXbb1fYJvwRz6IcY-gN-NlPzhdcmdKfRReyHERfdcGpxfY6TO8Y7dO1NF13yzyX6XJcfxVu62b1WxWqTGqrFlHJlW2qBktwC04KbthWeW-2YbeYeD1Q53thWsQb8vBzkvFFWET9rwnG2RPd_v8E5t_8ew9GM5z0BIn_r2Q9_eUh1
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/ICSE-SEIP58684.2023.00018
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 9798350300376
EISSN 2832-7659
EndPage 149
ExternalDocumentID 10172587
Genre orig-research
GroupedDBID 6IE
6IL
6IN
AAWTH
ABLEC
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
OCL
RIE
RIL
ID FETCH-LOGICAL-a296t-58bd2b0214b03965add6f5b9e3bc832f028e5cbd83c0fe5ce045c8b81f6f56e53
IEDL.DBID RIE
ISICitedReferencesCount 8
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001032815500013&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate Wed Aug 27 02:20:47 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a296t-58bd2b0214b03965add6f5b9e3bc832f028e5cbd83c0fe5ce045c8b81f6f56e53
PageCount 12
ParticipantIDs ieee_primary_10172587
PublicationCentury 2000
PublicationDate 2023-May
PublicationDateYYYYMMDD 2023-05-01
PublicationDate_xml – month: 05
  year: 2023
  text: 2023-May
PublicationDecade 2020
PublicationTitle IEEE/ACM International Conference on Software Engineering: Software Engineering in Practice (Online)
PublicationTitleAbbrev ICSE-SEIP
PublicationYear 2023
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssib052587722
ssj0003211720
Score 2.308618
Snippet Failure diagnosis is critical to the maintenance of large-scale cloud systems, which has attracted tremendous attention from academia and industry over the...
SourceID ieee
SourceType Publisher
StartPage 138
SubjectTerms Cloud Systems
Contrast Pattern
Diagnosis
Incident
Industries
Maintenance engineering
Manuals
Measurement
Metaheuristics
Needles
Software algorithms
Title CONAN: Diagnosing Batch Failures for Cloud Systems
URI https://ieeexplore.ieee.org/document/10172587
WOSCitedRecordID wos001032815500013&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LSwMxEA62iHhSseKbCF5jt8nmsd50bbGXtaBCbyXJTrAgrfTh73eyu1UvHryFJIfMJPNIMvMNIdcerPXGCNbzPGUpeMscaMe0UyhNUgmQtio2oYvCjMfZqElWr3JhAKAKPoOb2Kz-8su5X8ensm48Plwa3SItrVWdrLU5PDIO6AaLLqphgVcbzZMdctXganaH-XMftdRwJI0y8T2FR3TTJNb7-FVZpTIsg71_LmmfdH5S9Ojo2_gckC2YHRKePxV3xS19qOPncIDeo6p9owM7jeHnS4ouKs3f5-uSNlDlHfI66L_kj6wpisAsz9SKSeNK7iLSmUtEpiRyVAXpMhDOo3QG9BdAelca4ZOALUCfzRtnegGnKZDiiLRn8xkcE6p4KTm6RzrxJe6SdMF46VNrwFgT0nBCOpHgyUeNezHZ0Hr6R_8Z2Y08rcMBz0l7tVjDBdn2n6vpcnFZ7dYXKpCT-w
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV07T8MwED5BQcAEiCLeGIk1NHXiR9ggtGpFCZUoUrfKds6iEmpRH_x-7CQFFgY2y_bgO_setu--A7g2qJSRMgqahsZBjEYFGoUOhOZOmhiPkKmi2ITIMjkcJv0qWb3IhUHEIvgMb3yz-MvPp2bpn8oa_vhQJsU6bLA4pmGZrrU6PswPiQqNziviyF1uBA234KpC1mx005eW01PdPpNc-hcV6vFNQ1_x41dtlcK0tHf_uag9qP8k6ZH-t_nZhzWcHABNn7O77JY8lBF0boDcO2X7Rtpq7APQ58Q5qSR9ny5zUoGV1-G13RqknaAqixAomvBFwKTOqfZYZzqMEs4cT7llOsFIGyef1nkMyIzOZWRC61rovDYjtWxaN40jiw6hNplO8AgIpzmjzkESocndPjFtpWEmVhKlkja2x1D3BI8-SuSL0YrWkz_6L2G7M3jqjXrd7PEUdjx_y-DAM6gtZks8h03zuRjPZxfFzn0BE9CXQg
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=IEEE%2FACM+International+Conference+on+Software+Engineering%3A+Software+Engineering+in+Practice+%28Online%29&rft.atitle=CONAN%3A+Diagnosing+Batch+Failures+for+Cloud+Systems&rft.au=Li%2C+Liqun&rft.au=Zhang%2C+Xu&rft.au=He%2C+Shilin&rft.au=Kang%2C+Yu&rft.date=2023-05-01&rft.pub=IEEE&rft.eissn=2832-7659&rft.spage=138&rft.epage=149&rft_id=info:doi/10.1109%2FICSE-SEIP58684.2023.00018&rft.externalDocID=10172587