Regularized projection pursuit for data with a small sample-to-variable ratio

As a systematic and holistic study of metabolites in plants, animals, and human beings, metabolomics has advanced considerably in recent years, due largely to the rapid development of analytical technology and the application of multivariate data analysis methods. Exploratory data analysis, which ha...

Full description

Saved in:
Bibliographic Details
Published in:Metabolomics Vol. 10; no. 4; pp. 589 - 606
Main Authors: Hou, Siyuan, Wentzell, Peter D.
Format: Journal Article
Language:English
Published: New York Springer US 01.08.2014
Springer Nature B.V
Subjects:
ISSN:1573-3882, 1573-3890
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:As a systematic and holistic study of metabolites in plants, animals, and human beings, metabolomics has advanced considerably in recent years, due largely to the rapid development of analytical technology and the application of multivariate data analysis methods. Exploratory data analysis, which has played a crucial role in this advance, aims to examine the natural data structure to reveal important information. Principal components analysis (PCA) is probably the most widely used technique for exploratory data analysis, but projection pursuit (PP) is another important method that often outperforms PCA because it is based on distributional rather than variance optimization. Recent algorithmic improvements have made the implementation of PP easier, but, when the sample size is small compared to the number of variables, it is found that PP (with kurtosis as a projection index) fails to gives meaningful information. Mathematically, this involves the ill-posed inverse problem that also occurs for many other multivariate data analysis methods that result in overfitting. In this work, a regularized projection pursuit (RPP) method is proposed to solve this problem and iterative optimization algorithms are developed for both step-wise univariate and multivariate PP. The utility of the algorithms is established using simulated data, which also demonstrates the use of ridge trace plots for the optimization of the ridge parameter. Three experimental data sets in the public domain are also analyzed, including a study on soy bean disease (47 samples × 35 variables), NMR spectral data for glomerulonephritis patients (50 × 200) and metabolomics data from a bovine diet study (39 × 47). In all cases, RPP showed superior class separation compared to PCA or ordinary PP.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
ISSN:1573-3882
1573-3890
DOI:10.1007/s11306-013-0612-z