There and back again: Outlier detection between statistical reasoning and data mining algorithms

Outlier detection has been a topic in statistics for centuries. Over mainly the last two decades, there has been also an increasing interest in the database and data mining community to develop scalable methods for outlier detection. Initially based on statistical reasoning, however, these methods s...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Wiley interdisciplinary reviews. Data mining and knowledge discovery Jg. 8; H. 6; S. e1280 - n/a
Hauptverfasser: Zimek, Arthur, Filzmoser, Peter
Format: Journal Article
Sprache:Englisch
Veröffentlicht: Hoboken, USA Wiley Periodicals, Inc 01.11.2018
Wiley Subscription Services, Inc
Schlagworte:
ISSN:1942-4787, 1942-4795
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Outlier detection has been a topic in statistics for centuries. Over mainly the last two decades, there has been also an increasing interest in the database and data mining community to develop scalable methods for outlier detection. Initially based on statistical reasoning, however, these methods soon lost the direct probabilistic interpretability of the derived outlier scores. Here, we detail from a joint point of view of data mining and statistics the roots and the path of development of statistical outlier detection and of database‐related data mining methods for outlier detection. We discuss their inherent meaning, review approaches to again find a statistically meaningful interpretation of outlier scores, and sketch related current research topics. This article is categorized under: Algorithmic Development > Statistics Algorithmic Development > Scalable Statistical Methods Technologies > Machine Learning Masking and swamping: A distribution model (green density contours) computed for the inliers (green points) reveals the outlier (red point) as far of. If the outlier, however, was taken into account when fitting the distribution model to the data (red density contours), the outlier itself might be well covered by the model (it is masked), while some inlier might now appear as being too far off (the lower right inlier is swamped).
Bibliographie:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:1942-4787
1942-4795
DOI:10.1002/widm.1280