CONAN: Diagnosing Batch Failures for Cloud Systems
Failure diagnosis is critical to the maintenance of large-scale cloud systems, which has attracted tremendous attention from academia and industry over the last decade. In this paper, we focus on diagnosing batch failures, which occur to a batch of instances of the same subject (e.g., API requests,...
Saved in:
| Published in: | IEEE/ACM International Conference on Software Engineering: Software Engineering in Practice (Online) pp. 138 - 149 |
|---|---|
| Main Authors: | , , , , , , , , , , |
| Format: | Conference Proceeding |
| Language: | English |
| Published: |
IEEE
01.05.2023
|
| Subjects: | |
| ISSN: | 2832-7659 |
| Online Access: | Get full text |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Abstract | Failure diagnosis is critical to the maintenance of large-scale cloud systems, which has attracted tremendous attention from academia and industry over the last decade. In this paper, we focus on diagnosing batch failures, which occur to a batch of instances of the same subject (e.g., API requests, VMs, nodes, etc.), resulting in degraded service availability and performance. Manual investigation over a large volume of high-dimensional telemetry data (e.g., logs, traces, and metrics) is labor-intensive and time-consuming, like finding a needle in a haystack. Meanwhile, existing proposed approaches are usually tailored for specific scenarios, which hinders their applications in diverse scenarios. According to our experience with Azure and Microsoft 365 - two world-leading cloud systems, when batch failures happen, the procedure of finding the root cause can be abstracted as looking for contrast patterns by comparing two groups of instances, such as failed vs. succeeded, slow vs. normal, or during vs. before an anomaly. We thus propose CONAN, an efficient and flexible framework that can automatically extract contrast patterns from contextual data. CONAN has been successfully integrated into multiple diagnostic tools for various products, which proves its usefulness in diagnosing real-world batch failures. |
|---|---|
| AbstractList | Failure diagnosis is critical to the maintenance of large-scale cloud systems, which has attracted tremendous attention from academia and industry over the last decade. In this paper, we focus on diagnosing batch failures, which occur to a batch of instances of the same subject (e.g., API requests, VMs, nodes, etc.), resulting in degraded service availability and performance. Manual investigation over a large volume of high-dimensional telemetry data (e.g., logs, traces, and metrics) is labor-intensive and time-consuming, like finding a needle in a haystack. Meanwhile, existing proposed approaches are usually tailored for specific scenarios, which hinders their applications in diverse scenarios. According to our experience with Azure and Microsoft 365 - two world-leading cloud systems, when batch failures happen, the procedure of finding the root cause can be abstracted as looking for contrast patterns by comparing two groups of instances, such as failed vs. succeeded, slow vs. normal, or during vs. before an anomaly. We thus propose CONAN, an efficient and flexible framework that can automatically extract contrast patterns from contextual data. CONAN has been successfully integrated into multiple diagnostic tools for various products, which proves its usefulness in diagnosing real-world batch failures. |
| Author | Zhang, Dongmei He, Shilin Lin, Qingwei Li, Liqun Kang, Yu Ma, Minghua Zhang, Xu Xu, Zhangwei Rajmohan, Saravan Dang, Yingnong Zhang, Hongyu |
| Author_xml | – sequence: 1 givenname: Liqun surname: Li fullname: Li, Liqun organization: Microsoft Research – sequence: 2 givenname: Xu surname: Zhang fullname: Zhang, Xu organization: Microsoft Research – sequence: 3 givenname: Shilin surname: He fullname: He, Shilin organization: Microsoft Research – sequence: 4 givenname: Yu surname: Kang fullname: Kang, Yu organization: Microsoft Research – sequence: 5 givenname: Hongyu surname: Zhang fullname: Zhang, Hongyu organization: The University of Newcastle – sequence: 6 givenname: Minghua surname: Ma fullname: Ma, Minghua organization: Microsoft Research – sequence: 7 givenname: Yingnong surname: Dang fullname: Dang, Yingnong organization: Microsoft Azure – sequence: 8 givenname: Zhangwei surname: Xu fullname: Xu, Zhangwei organization: Microsoft Azure – sequence: 9 givenname: Saravan surname: Rajmohan fullname: Rajmohan, Saravan organization: Microsoft 365 – sequence: 10 givenname: Qingwei surname: Lin fullname: Lin, Qingwei organization: Microsoft Research – sequence: 11 givenname: Dongmei surname: Zhang fullname: Zhang, Dongmei organization: Microsoft Research |
| BookMark | eNotjk1OwzAYBQ0CiVJyAxbmAAmf7fiPXQkpRKpapMC6shO7WEoTFKeL3p4gWM1bjJ7mFl31Q-8QeiCQEQL6sSrqMq3L6p0rofKMAmUZABB1gRIttWIcGACT4hItqGI0lYLrG5TEGCxwypWUlC4QLXbb1fYJvwRz6IcY-gN-NlPzhdcmdKfRReyHERfdcGpxfY6TO8Y7dO1NF13yzyX6XJcfxVu62b1WxWqTGqrFlHJlW2qBktwC04KbthWeW-2YbeYeD1Q53thWsQb8vBzkvFFWET9rwnG2RPd_v8E5t_8ew9GM5z0BIn_r2Q9_eUh1 |
| CODEN | IEEPAD |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IL CBEJK RIE RIL |
| DOI | 10.1109/ICSE-SEIP58684.2023.00018 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE/IET Electronic Library (IEL) (UW System Shared) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| EISBN | 9798350300376 |
| EISSN | 2832-7659 |
| EndPage | 149 |
| ExternalDocumentID | 10172587 |
| Genre | orig-research |
| GroupedDBID | 6IE 6IL 6IN AAWTH ABLEC ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK OCL RIE RIL |
| ID | FETCH-LOGICAL-a296t-58bd2b0214b03965add6f5b9e3bc832f028e5cbd83c0fe5ce045c8b81f6f56e53 |
| IEDL.DBID | RIE |
| ISICitedReferencesCount | 8 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001032815500013&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| IngestDate | Wed Aug 27 02:20:47 EDT 2025 |
| IsPeerReviewed | false |
| IsScholarly | true |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-a296t-58bd2b0214b03965add6f5b9e3bc832f028e5cbd83c0fe5ce045c8b81f6f56e53 |
| PageCount | 12 |
| ParticipantIDs | ieee_primary_10172587 |
| PublicationCentury | 2000 |
| PublicationDate | 2023-May |
| PublicationDateYYYYMMDD | 2023-05-01 |
| PublicationDate_xml | – month: 05 year: 2023 text: 2023-May |
| PublicationDecade | 2020 |
| PublicationTitle | IEEE/ACM International Conference on Software Engineering: Software Engineering in Practice (Online) |
| PublicationTitleAbbrev | ICSE-SEIP |
| PublicationYear | 2023 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| SSID | ssib052587722 ssj0003211720 |
| Score | 2.308618 |
| Snippet | Failure diagnosis is critical to the maintenance of large-scale cloud systems, which has attracted tremendous attention from academia and industry over the... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 138 |
| SubjectTerms | Cloud Systems Contrast Pattern Diagnosis Incident Industries Maintenance engineering Manuals Measurement Metaheuristics Needles Software algorithms |
| Title | CONAN: Diagnosing Batch Failures for Cloud Systems |
| URI | https://ieeexplore.ieee.org/document/10172587 |
| WOSCitedRecordID | wos001032815500013&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LSwMxEB5sEfGkYsU3EbyuzSbksd50bbEga8EHvZW8FgvSSh_-fie7W_XiwVtI9rDJzn75Jpn5BuDSKx1Vy10iAkcHJaU6MVmpEQxp4J4bayrh-dcHVRR6NMqGTbJ6lQsTQqiCz8JVbFZ3-X7mVvGorBvNhwmtWtBSStbJWmvjEXFANVp0EYY5ujaK0S24aHQ1u4P8qYcoNRgKLXU8T2FR3ZTGeh-_KqtUG0t_55-vtAudnxQ9MvzefPZgI0z3geWPxU1xTe7q-DkcILcItW-kbyYx_HxBkKKS_H228qSRKu_AS7_3nN8nTVGExLBMLhOhrWc2Kp1ZyjMpEJ9kKWwWuHX4d5bIF4Jw1mvuaImtgJzNaavTEh-TQfADaE9n03AIBF0FnblUKEM98iiDVAXpG3NacmdFao6gEyc8_qh1L8bruR7_0X8C23FN63DAU2gv56twBpvuczlZzM-rr_UFwKeRXg |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV07T8MwED5BQcAEiCLeGIk11LHrxGGD0KoVJVSiILbKr4hKqEV98Ps5JymwMLBZdobYuXz-zr77DuDSxtKrlptAOI4OSkhloJJcIhhSxy1XWhXC8y-9OMvk62vSr5LVi1wY51wRfOaufLO4y7cTs_BHZQ1vPkzIeBXWRLPJaJmutTQf4YfiSo3OAzFH5yZmdAMuKmXNRjd9aiFOdftCRtKfqDCvb0p9xY9ftVWKraW9_c-X2oH6T5Ie6X9vP7uw4sZ7wNLH7Ca7JndlBB0OkFsE2zfSViMfgD4jSFJJ-j5ZWFKJldfhud0apJ2gKosQKJZE80BIbZn2Wmea8iQSiFBRLnTiuDb4f-bIGJww2kpuaI4th6zNSC3DHB-LnOD7UBtPxu4ACDoLMjGhiBW1yKQUkhUkcMzIiBstQnUIdT_h4UepfDFczvXoj_5z2OwMHnrDXje7P4Ytv75lcOAJ1ObThTuFdfM5H82mZ8WX-wKHpZSl |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=IEEE%2FACM+International+Conference+on+Software+Engineering%3A+Software+Engineering+in+Practice+%28Online%29&rft.atitle=CONAN%3A+Diagnosing+Batch+Failures+for+Cloud+Systems&rft.au=Li%2C+Liqun&rft.au=Zhang%2C+Xu&rft.au=He%2C+Shilin&rft.au=Kang%2C+Yu&rft.date=2023-05-01&rft.pub=IEEE&rft.eissn=2832-7659&rft.spage=138&rft.epage=149&rft_id=info:doi/10.1109%2FICSE-SEIP58684.2023.00018&rft.externalDocID=10172587 |