A survey on dataset quality in machine learning

With the rise of big data, the quality of datasets has become a crucial factor affecting the performance of machine learning models. High-quality datasets are essential for the realization of data value. This survey article summarizes the research direction of dataset quality in machine learning, in...

Full description

Saved in:
Bibliographic Details
Published in:Information and software technology Vol. 162; p. 107268
Main Authors: Gong, Youdi, Liu, Guangzhen, Xue, Yunzhi, Li, Rui, Meng, Lingzhong
Format: Journal Article
Language:English
Published: Elsevier B.V 01.10.2023
Subjects:
ISSN:0950-5849, 1873-6025
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:With the rise of big data, the quality of datasets has become a crucial factor affecting the performance of machine learning models. High-quality datasets are essential for the realization of data value. This survey article summarizes the research direction of dataset quality in machine learning, including the definition of related concepts, analysis of quality issues and risks, and a review of dataset quality dimensions and metrics throughout the dataset lifecycle and a review of dataset quality metrics analyzed from a dataset lifecycle perspective and summarized in literatures. Furthermore, this article introduces a comprehensive quality evaluation process, which includes a framework for dataset quality evaluation with dimensions and metrics, computation methods for quality metrics, and assessment models. These studies provide valuable guidance for evaluating dataset quality in the field of machine learning, which can help improve the accuracy, efficiency, and generalization ability of machine learning models, and promote the development and application of artificial intelligence technology.
ISSN:0950-5849
1873-6025
DOI:10.1016/j.infsof.2023.107268