Optimal Subdata Selection for Prediction Based on the Distribution of the Covariates

Saved in:
Bibliographic Details
Title: Optimal Subdata Selection for Prediction Based on the Distribution of the Covariates
Authors: Alvaro Cia-Mina, Jesus Lopez-Fidalgo, Weng Kee Wong
Contributors: Repositorio de Navarra
Source: Dadun. Depósito Académico Digital de la Universidad de Navarra
Universidad de Navarra
Publisher Information: Institute of Electrical and Electronics Engineers (IEEE), 2025.
Publication Year: 2025
Subject Terms: Optimization, D Optimality, Reviews, Simple Linear Regression Model, Truncated Normal, Proportion Of Data, Directional Derivative, Data Streams, Predictive Models, Equivalence Theorem, Training, Simple Random Sampling, Fisher Information, Exact Design, Quadratic Model, D Optimal Design, Covariance Matrices, Density Functional Theory, Labeling Cost, Design Criteria, Optimization Criteria, Bigger Sample, Data Models, X Random, Prediction Error, Information Matrix, I Optimality, Streaming Data, Subsampling Method, Acceptance Region, Vectors, Probability Distribution, Mean Squared Prediction Error, Thermal Comfort, Optimization Problem, Probability Density Function, Distribution Of Covariates, Optimal Selection, Linear Model, Type Of Design, Subsample Size, Marginal Density, Sequential Selection, Subsampling, Optimal Approximate Designs, Joint Distribution
Description: Huge data sets are widely available now and there is growing interest in selecting an optimal subsample from the full data set to improve inference efficiency and reduce labeling costs. We propose a new criterion called J–optimality, that builds upon a popular optimal selection criterion that minimizes the Random–X prediction error by additionally incorporating the joint distribution of the covariates. A key advantage of our approach is that we can relate the subsampling selection problem to that of finding an optimal approximate design under a convex criterion, where analytical tools for finding and studying them are already available. Consequently, the J–optimal subsampling method comes with theoretical results and theory-based algorithms for finding them. Simulation results and real data analysis show our proposed methods outperform current subsampling methods and the proposed algorithms can also adapt efficiently to select an optimal subsample from streaming data.
Document Type: Article
File Description: application/pdf
ISSN: 2372-2096
DOI: 10.1109/tbdata.2025.3552343
Access URL: https://hdl.handle.net/10171/115920
Rights: CC BY
Accession Number: edsair.doi.dedup.....de0e7619ce749f3806e494e0741592cc
Database: OpenAIRE
Description
Abstract:Huge data sets are widely available now and there is growing interest in selecting an optimal subsample from the full data set to improve inference efficiency and reduce labeling costs. We propose a new criterion called J–optimality, that builds upon a popular optimal selection criterion that minimizes the Random–X prediction error by additionally incorporating the joint distribution of the covariates. A key advantage of our approach is that we can relate the subsampling selection problem to that of finding an optimal approximate design under a convex criterion, where analytical tools for finding and studying them are already available. Consequently, the J–optimal subsampling method comes with theoretical results and theory-based algorithms for finding them. Simulation results and real data analysis show our proposed methods outperform current subsampling methods and the proposed algorithms can also adapt efficiently to select an optimal subsample from streaming data.
ISSN:23722096
DOI:10.1109/tbdata.2025.3552343