Improving random forest predictions in small datasets from two-phase sampling designs.

Gespeichert in:
Bibliographische Detailangaben
Titel: Improving random forest predictions in small datasets from two-phase sampling designs.
Autoren: Han S; Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, USA., Williamson BD; Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, USA., Fong Y; Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, USA. youyifong@gmail.com.
Quelle: BMC medical informatics and decision making [BMC Med Inform Decis Mak] 2021 Nov 22; Vol. 21 (1), pp. 322. Date of Electronic Publication: 2021 Nov 22.
Publikationsart: Clinical Trial, Phase III; Journal Article; Research Support, N.I.H., Extramural
Sprache: English
Info zur Zeitschrift: Publisher: BioMed Central Country of Publication: England NLM ID: 101088682 Publication Model: Electronic Cited Medium: Internet ISSN: 1472-6947 (Electronic) Linking ISSN: 14726947 NLM ISO Abbreviation: BMC Med Inform Decis Mak Subsets: MEDLINE
Imprint Name(s): Original Publication: London : BioMed Central, [2001-
MeSH-Schlagworte: Machine Learning* , Vaccine Efficacy*, Humans ; Probability
Abstract: Background: While random forests are one of the most successful machine learning methods, it is necessary to optimize their performance for use with datasets resulting from a two-phase sampling design with a small number of cases-a common situation in biomedical studies, which often have rare outcomes and covariates whose measurement is resource-intensive.
Methods: Using an immunologic marker dataset from a phase III HIV vaccine efficacy trial, we seek to optimize random forest prediction performance using combinations of variable screening, class balancing, weighting, and hyperparameter tuning.
Results: Our experiments show that while class balancing helps improve random forest prediction performance when variable screening is not applied, class balancing has a negative impact on performance in the presence of variable screening. The impact of the weighting similarly depends on whether variable screening is applied. Hyperparameter tuning is ineffective in situations with small sample sizes. We further show that random forests under-perform generalized linear models for some subsets of markers, and prediction performance on this dataset can be improved by stacking random forests and generalized linear models trained on different subsets of predictors, and that the extent of improvement depends critically on the dissimilarities between candidate learner predictions.
Conclusion: In small datasets from two-phase sampling design, variable screening and inverse sampling probability weighting are important for achieving good prediction performance of random forests. In addition, stacking random forests and simple linear models can offer improvements over random forests.
(© 2021. The Author(s).)
References: Circulation. 2007 Feb 20;115(7):928-35. (PMID: 17309939)
J Clin Invest. 2019 Nov 1;129(11):4838-4849. (PMID: 31589165)
Comput Struct Biotechnol J. 2014 Nov 15;13:8-17. (PMID: 25750696)
BMC Bioinformatics. 2013 Aug 27;14:261. (PMID: 23981907)
Clin Infect Dis. 2018 Jan 6;66(1):149-153. (PMID: 29020316)
J Infect Dis. 2017 May 1;215(9):1376-1385. (PMID: 28199679)
Am J Epidemiol. 2009 Jun 1;169(11):1398-405. (PMID: 19357328)
PLoS One. 2012;7(8):e43927. (PMID: 22937126)
Int J Med Inform. 2019 Aug;128:79-86. (PMID: 31103449)
N Engl J Med. 2013 Nov 28;369(22):2083-92. (PMID: 24099601)
Stat Med. 2014 Sep 28;33(22):3946-59. (PMID: 24825728)
Bioinformation. 2011;7(3):142-6. (PMID: 22125385)
Biostatistics. 2016 Jul;17(3):499-522. (PMID: 26883772)
BMC Bioinformatics. 2006 Jan 06;7:3. (PMID: 16398926)
J Infect Dis. 2018 Mar 28;217(8):1280-1288. (PMID: 29325070)
PLoS One. 2015 May 06;10(5):e0125811. (PMID: 25946106)
Grant Information: S10 OD028685 United States OD NIH HHS; S10OD028685 United States NH NIH HHS; UM1 AI068635 United States AI NIAID NIH HHS; R01 AI122991 United States AI NIAID NIH HHS
Contributed Indexing: Keywords: Case–control design; Class imbalance; HIV vaccine; Variable screening
Entry Date(s): Date Created: 20211123 Date Completed: 20220124 Latest Revision: 20240407
Update Code: 20250114
PubMed Central ID: PMC8607560
DOI: 10.1186/s12911-021-01688-3
PMID: 34809631
Datenbank: MEDLINE
Beschreibung
Abstract:Background: While random forests are one of the most successful machine learning methods, it is necessary to optimize their performance for use with datasets resulting from a two-phase sampling design with a small number of cases-a common situation in biomedical studies, which often have rare outcomes and covariates whose measurement is resource-intensive.<br />Methods: Using an immunologic marker dataset from a phase III HIV vaccine efficacy trial, we seek to optimize random forest prediction performance using combinations of variable screening, class balancing, weighting, and hyperparameter tuning.<br />Results: Our experiments show that while class balancing helps improve random forest prediction performance when variable screening is not applied, class balancing has a negative impact on performance in the presence of variable screening. The impact of the weighting similarly depends on whether variable screening is applied. Hyperparameter tuning is ineffective in situations with small sample sizes. We further show that random forests under-perform generalized linear models for some subsets of markers, and prediction performance on this dataset can be improved by stacking random forests and generalized linear models trained on different subsets of predictors, and that the extent of improvement depends critically on the dissimilarities between candidate learner predictions.<br />Conclusion: In small datasets from two-phase sampling design, variable screening and inverse sampling probability weighting are important for achieving good prediction performance of random forests. In addition, stacking random forests and simple linear models can offer improvements over random forests.<br /> (© 2021. The Author(s).)
ISSN:1472-6947
DOI:10.1186/s12911-021-01688-3