ADAMAS: Adaptive Domain-Aware Performance Anomaly Detection in Cloud Service Systems

A common practice in the reliability engineering of cloud services involves the collection of monitoring metrics, followed by comprehensive analysis to identify performance issues. However, existing methods often fall short of detecting diverse and evolving anomalies across different services. More-...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Proceedings / International Conference on Software Engineering s. 911 - 923
Hlavní autoři: Gu, Wenwei, Gu, Jiazhen, Liu, Jinyang, Chen, Zhuangbin, Zhang, Jianping, Kuang, Jinxi, Feng, Cong, Yang, Yongqiang, Lyu, Michael R.
Médium: Konferenční příspěvek
Jazyk:angličtina
Vydáno: IEEE 26.04.2025
Témata:
ISSN:1558-1225
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:A common practice in the reliability engineering of cloud services involves the collection of monitoring metrics, followed by comprehensive analysis to identify performance issues. However, existing methods often fall short of detecting diverse and evolving anomalies across different services. More-over, there exists a significant gap between the technical and business interpretation of anomalies, i.e., a detected anomaly may not have an actual impact on system performance or user experience. To address these challenges, we propose ADAMAS, an adaptive AutoML-based anomaly detection framework aiming to achieve practical anomaly detection in production cloud systems. To improve the ability to detect cross-service anomalies, we design a novel unsupervised evaluation function to facilitate the automatic searching of the optimal model structure and parameters. ADAMAS also contains a lightweight human-in-the-loop design, which can efficiently incorporate expert knowledge to adapt to the evolving anomaly patterns and bridge the gap between predicted anomalies and actual business exceptions. Fur-thermore, through monitoring the rate of mispredicted anomalies, ADAMAS proactively re-configures the optimal model, forming a continuous loop of system improvement. Extensive evaluation on one public and two industrial datasets shows that ADAMAS outperforms all baseline models with a 0.891 F1-score. The ablation study also proves the effectiveness of the evaluation function design and the incorporation of expert knowledge.
ISSN:1558-1225
DOI:10.1109/ICSE55347.2025.00084