Segmentation of Multivariate Mixed Data via Lossy Data Coding and Compression

In this paper, based on ideas from lossy data coding and compression, we present a simple but effective technique for segmenting multivariate mixed data that are drawn from a mixture of Gaussian distributions, which are allowed to be almost degenerate. The goal is to find the optimal segmentation th...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:IEEE transactions on pattern analysis and machine intelligence Ročník 29; číslo 9; s. 1546 - 1562
Hlavní autori: Yi Ma, Derksen, H., Wei Hong, Wright, J.
Médium: Journal Article
Jazyk:English
Vydavateľské údaje: Los Alamitos, CA IEEE 01.09.2007
IEEE Computer Society
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Predmet:
ISSN:0162-8828, 1939-3539
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Popis
Shrnutí:In this paper, based on ideas from lossy data coding and compression, we present a simple but effective technique for segmenting multivariate mixed data that are drawn from a mixture of Gaussian distributions, which are allowed to be almost degenerate. The goal is to find the optimal segmentation that minimizes the overall coding length of the segmented data, subject to a given distortion. By analyzing the coding length/rate of mixed data, we formally establish some strong connections of data segmentation to many fundamental concepts in lossy data compression and rate-distortion theory. We show that a deterministic segmentation is approximately the (asymptotically) optimal solution for compressing mixed data. We propose a very simple and effective algorithm that depends on a single parameter, the allowable distortion. At any given distortion, the algorithm automatically determines the corresponding number and dimension of the groups and does not involve any parameter estimation. Simulation results reveal intriguing phase-transition-like behaviors of the number of segments when changing the level of distortion or the amount of outliers. Finally, we demonstrate how this technique can be readily applied to segment real imagery and bioinformatic data.
Bibliografia:ObjectType-Article-2
SourceType-Scholarly Journals-1
ObjectType-Feature-1
content type line 14
content type line 23
ObjectType-Article-1
ObjectType-Feature-2
ISSN:0162-8828
1939-3539
DOI:10.1109/TPAMI.2007.1085