Empirical characterization of random forest variable importance measures

Microarray studies yield data sets consisting of a large number of candidate predictors (genes) on a small number of observations (samples). When interest lies in predicting phenotypic class using gene expression data, often the goals are both to produce an accurate classifier and to uncover the pre...

Full description

Saved in:

Bibliographic Details
Published in:	Computational statistics & data analysis Vol. 52; no. 4; pp. 2249 - 2260
Main Authors:	Archer, Kellie J., Kimes, Ryan V.
Format:	Journal Article
Language:	English
Published:	Amsterdam Elsevier B.V 01.01.2008 Elsevier Science Elsevier
Series:	Computational Statistics & Data Analysis
Subjects:	Bootstrap aggregating Classification tree Exact sciences and technology General topics Mathematics Multivariate analysis Numerical analysis Numerical analysis. Scientific computation Numerical methods in probability and statistics Probability and statistics Probability theory and stochastic processes Random forest Sciences and techniques of general use Statistics Stochastic processes Variable importance Random forest Variable importance Classification tree Bootstrap aggregating Correlation Forests Random measure Multivariate analysis Covariate Stochastic process Statistical simulation Microarray Learning Characterization Law of large numbers Random variable Discriminant analysis Data analysis Prior distribution Statistical association Statistical estimation Neural network Statistical computation Correlation analysis Small sample
ISSN:	0167-9473, 1872-7352
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Microarray studies yield data sets consisting of a large number of candidate predictors (genes) on a small number of observations (samples). When interest lies in predicting phenotypic class using gene expression data, often the goals are both to produce an accurate classifier and to uncover the predictive structure of the problem. Most machine learning methods, such as k -nearest neighbors, support vector machines, and neural networks, are useful for classification. However, these methods provide no insight regarding the covariates that best contribute to the predictive structure. Other methods, such as linear discriminant analysis, require the predictor space be substantially reduced prior to deriving the classifier. A recently developed method, random forests (RF), does not require reduction of the predictor space prior to classification. Additionally, RF yield variable importance measures for each candidate predictor. This study examined the effectiveness of RF variable importance measures in identifying the true predictor among a large number of candidate predictors. An extensive simulation study was conducted using 20 levels of correlation among the predictor variables and 7 levels of association between the true predictor and the dichotomous response. We conclude that the RF methodology is attractive for use in classification problems when the goals of the study are to produce an accurate classifier and to provide insight regarding the discriminative ability of individual predictor variables. Such goals are common among microarray studies, and therefore application of the RF methodology for the purpose of obtaining variable importance measures is demonstrated on a microarray data set.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	0167-9473 1872-7352
DOI:	10.1016/j.csda.2007.08.015