Optimal Subdata Selection for Prediction Based on the Distribution of the Covariates
Saved in:
| Title: | Optimal Subdata Selection for Prediction Based on the Distribution of the Covariates |
|---|---|
| Authors: | Alvaro Cia-Mina, Jesus Lopez-Fidalgo, Weng Kee Wong |
| Contributors: | Repositorio de Navarra |
| Source: | Dadun. Depósito Académico Digital de la Universidad de Navarra Universidad de Navarra |
| Publisher Information: | Institute of Electrical and Electronics Engineers (IEEE), 2025. |
| Publication Year: | 2025 |
| Subject Terms: | Optimization, D Optimality, Reviews, Simple Linear Regression Model, Truncated Normal, Proportion Of Data, Directional Derivative, Data Streams, Predictive Models, Equivalence Theorem, Training, Simple Random Sampling, Fisher Information, Exact Design, Quadratic Model, D Optimal Design, Covariance Matrices, Density Functional Theory, Labeling Cost, Design Criteria, Optimization Criteria, Bigger Sample, Data Models, X Random, Prediction Error, Information Matrix, I Optimality, Streaming Data, Subsampling Method, Acceptance Region, Vectors, Probability Distribution, Mean Squared Prediction Error, Thermal Comfort, Optimization Problem, Probability Density Function, Distribution Of Covariates, Optimal Selection, Linear Model, Type Of Design, Subsample Size, Marginal Density, Sequential Selection, Subsampling, Optimal Approximate Designs, Joint Distribution |
| Description: | Huge data sets are widely available now and there is growing interest in selecting an optimal subsample from the full data set to improve inference efficiency and reduce labeling costs. We propose a new criterion called J–optimality, that builds upon a popular optimal selection criterion that minimizes the Random–X prediction error by additionally incorporating the joint distribution of the covariates. A key advantage of our approach is that we can relate the subsampling selection problem to that of finding an optimal approximate design under a convex criterion, where analytical tools for finding and studying them are already available. Consequently, the J–optimal subsampling method comes with theoretical results and theory-based algorithms for finding them. Simulation results and real data analysis show our proposed methods outperform current subsampling methods and the proposed algorithms can also adapt efficiently to select an optimal subsample from streaming data. |
| Document Type: | Article |
| File Description: | application/pdf |
| ISSN: | 2372-2096 |
| DOI: | 10.1109/tbdata.2025.3552343 |
| Access URL: | https://hdl.handle.net/10171/115920 |
| Rights: | CC BY |
| Accession Number: | edsair.doi.dedup.....de0e7619ce749f3806e494e0741592cc |
| Database: | OpenAIRE |
| Abstract: | Huge data sets are widely available now and there is growing interest in selecting an optimal subsample from the full data set to improve inference efficiency and reduce labeling costs. We propose a new criterion called J–optimality, that builds upon a popular optimal selection criterion that minimizes the Random–X prediction error by additionally incorporating the joint distribution of the covariates. A key advantage of our approach is that we can relate the subsampling selection problem to that of finding an optimal approximate design under a convex criterion, where analytical tools for finding and studying them are already available. Consequently, the J–optimal subsampling method comes with theoretical results and theory-based algorithms for finding them. Simulation results and real data analysis show our proposed methods outperform current subsampling methods and the proposed algorithms can also adapt efficiently to select an optimal subsample from streaming data. |
|---|---|
| ISSN: | 23722096 |
| DOI: | 10.1109/tbdata.2025.3552343 |
Nájsť tento článok vo Web of Science