An Optimal Sample Data Usage Strategy to Minimize Overfitting and Underfitting Effects in Regression Tree Models Based on Remotely-Sensed Data

Uloženo v:
Podrobná bibliografie
Název: An Optimal Sample Data Usage Strategy to Minimize Overfitting and Underfitting Effects in Regression Tree Models Based on Remotely-Sensed Data
Autoři: Yingxin Gu, Bruce K. Wylie, Stephen P. Boyte, Joshua Picotte, Daniel M. Howard, Kelcy Smith, Kurtis J. Nelson
Zdroj: Remote Sensing, Vol 8, Iss 11, p 943 (2016)
Informace o vydavateli: MDPI AG
Rok vydání: 2016
Sbírka: Directory of Open Access Journals: DOAJ Articles
Témata: remote sensing, data mining, regression tree mapping model, Cubist optimization, Python scripts, overfitting, underfitting, MODIS NDVI, Landsat, Science
Popis: Regression tree models have been widely used for remote sensing-based ecosystem mapping. Improper use of the sample data (model training and testing data) may cause overfitting and underfitting effects in the model. The goal of this study is to develop an optimal sampling data usage strategy for any dataset and identify an appropriate number of rules in the regression tree model that will improve its accuracy and robustness. Landsat 8 data and Moderate-Resolution Imaging Spectroradiometer-scaled Normalized Difference Vegetation Index (NDVI) were used to develop regression tree models. A Python procedure was designed to generate random replications of model parameter options across a range of model development data sizes and rule number constraints. The mean absolute difference (MAD) between the predicted and actual NDVI (scaled NDVI, value from 0–200) and its variability across the different randomized replications were calculated to assess the accuracy and stability of the models. In our case study, a six-rule regression tree model developed from 80% of the sample data had the lowest MAD (MADtraining = 2.5 and MADtesting = 2.4), which was suggested as the optimal model. This study demonstrates how the training data and rule number selections impact model accuracy and provides important guidance for future remote-sensing-based ecosystem modeling.
Druh dokumentu: article in journal/newspaper
Jazyk: English
Relation: http://www.mdpi.com/2072-4292/8/11/943; https://doaj.org/toc/2072-4292; https://doaj.org/article/3417c97c6d5f4fda83a262c5cefa681d
DOI: 10.3390/rs8110943
Dostupnost: https://doi.org/10.3390/rs8110943
https://doaj.org/article/3417c97c6d5f4fda83a262c5cefa681d
Přístupové číslo: edsbas.99217F09
Databáze: BASE
Popis
Abstrakt:Regression tree models have been widely used for remote sensing-based ecosystem mapping. Improper use of the sample data (model training and testing data) may cause overfitting and underfitting effects in the model. The goal of this study is to develop an optimal sampling data usage strategy for any dataset and identify an appropriate number of rules in the regression tree model that will improve its accuracy and robustness. Landsat 8 data and Moderate-Resolution Imaging Spectroradiometer-scaled Normalized Difference Vegetation Index (NDVI) were used to develop regression tree models. A Python procedure was designed to generate random replications of model parameter options across a range of model development data sizes and rule number constraints. The mean absolute difference (MAD) between the predicted and actual NDVI (scaled NDVI, value from 0–200) and its variability across the different randomized replications were calculated to assess the accuracy and stability of the models. In our case study, a six-rule regression tree model developed from 80% of the sample data had the lowest MAD (MADtraining = 2.5 and MADtesting = 2.4), which was suggested as the optimal model. This study demonstrates how the training data and rule number selections impact model accuracy and provides important guidance for future remote-sensing-based ecosystem modeling.
DOI:10.3390/rs8110943