A survey on dataset quality in machine learning
With the rise of big data, the quality of datasets has become a crucial factor affecting the performance of machine learning models. High-quality datasets are essential for the realization of data value. This survey article summarizes the research direction of dataset quality in machine learning, in...
Uloženo v:
| Vydáno v: | Information and software technology Ročník 162; s. 107268 |
|---|---|
| Hlavní autoři: | , , , , |
| Médium: | Journal Article |
| Jazyk: | angličtina |
| Vydáno: |
Elsevier B.V
01.10.2023
|
| Témata: | |
| ISSN: | 0950-5849, 1873-6025 |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Abstract | With the rise of big data, the quality of datasets has become a crucial factor affecting the performance of machine learning models. High-quality datasets are essential for the realization of data value. This survey article summarizes the research direction of dataset quality in machine learning, including the definition of related concepts, analysis of quality issues and risks, and a review of dataset quality dimensions and metrics throughout the dataset lifecycle and a review of dataset quality metrics analyzed from a dataset lifecycle perspective and summarized in literatures. Furthermore, this article introduces a comprehensive quality evaluation process, which includes a framework for dataset quality evaluation with dimensions and metrics, computation methods for quality metrics, and assessment models. These studies provide valuable guidance for evaluating dataset quality in the field of machine learning, which can help improve the accuracy, efficiency, and generalization ability of machine learning models, and promote the development and application of artificial intelligence technology. |
|---|---|
| AbstractList | With the rise of big data, the quality of datasets has become a crucial factor affecting the performance of machine learning models. High-quality datasets are essential for the realization of data value. This survey article summarizes the research direction of dataset quality in machine learning, including the definition of related concepts, analysis of quality issues and risks, and a review of dataset quality dimensions and metrics throughout the dataset lifecycle and a review of dataset quality metrics analyzed from a dataset lifecycle perspective and summarized in literatures. Furthermore, this article introduces a comprehensive quality evaluation process, which includes a framework for dataset quality evaluation with dimensions and metrics, computation methods for quality metrics, and assessment models. These studies provide valuable guidance for evaluating dataset quality in the field of machine learning, which can help improve the accuracy, efficiency, and generalization ability of machine learning models, and promote the development and application of artificial intelligence technology. |
| ArticleNumber | 107268 |
| Author | Meng, Lingzhong Gong, Youdi Xue, Yunzhi Liu, Guangzhen Li, Rui |
| Author_xml | – sequence: 1 givenname: Youdi surname: Gong fullname: Gong, Youdi organization: Institute of Software Chinese Academy of Sciences, Beijing, 100190, China – sequence: 2 givenname: Guangzhen surname: Liu fullname: Liu, Guangzhen organization: Institute of Software Chinese Academy of Sciences, Beijing, 100190, China – sequence: 3 givenname: Yunzhi surname: Xue fullname: Xue, Yunzhi organization: Institute of Software Chinese Academy of Sciences, Beijing, 100190, China – sequence: 4 givenname: Rui surname: Li fullname: Li, Rui organization: Institute of Software Chinese Academy of Sciences, Beijing, 100190, China – sequence: 5 givenname: Lingzhong surname: Meng fullname: Meng, Lingzhong email: lingzhong@iscas.ac.cn organization: Institute of Software Chinese Academy of Sciences, Beijing, 100190, China |
| BookMark | eNqFz8FKAzEQxvEgFWyrb-BhX2DbySabzXoQStEqFLzoOaTJRFO2WU3SQt_elvXkQU8DA_8PfhMyCn1AQm4pzChQMd_OfHCpd7MKKnZ6NZWQF2RMZcNKAVU9ImNoayhrydsrMklpC0AbYDAm80WR9vGAx6IPhdVZJ8zF1153Ph8LH4qdNh8-YNGhjsGH92ty6XSX8ObnTsnb48Pr8qlcv6yel4t1aRiIXGrrNozrpkEpWzRccwsNVtw65KJ1vNJgQfBNI7SpUdROVAwoYxKcrOmGsim5G3ZN7FOK6JTxWWffhxy17xQFdaarrRro6kxXA_0U81_xZ_Q7HY__ZfdDhifYwWNUyXgMBq2PaLKyvf974BvZ5Hae |
| CitedBy_id | crossref_primary_10_1016_j_eswa_2025_127018 crossref_primary_10_1186_s42234_024_00156_3 crossref_primary_10_1016_j_dss_2025_114493 crossref_primary_10_1109_TCE_2025_3543209 crossref_primary_10_3390_computers14080327 crossref_primary_10_3390_jmse13030559 crossref_primary_10_1016_j_jhydrol_2025_133955 crossref_primary_10_1007_s44210_025_00055_5 crossref_primary_10_3390_info15060295 crossref_primary_10_1016_j_tws_2025_113014 crossref_primary_10_3390_s24041068 crossref_primary_10_1002_aic_18558 crossref_primary_10_1016_j_applthermaleng_2024_125284 crossref_primary_10_1016_j_cscm_2024_e03211 crossref_primary_10_1016_j_rsurfi_2025_100505 crossref_primary_10_1007_s10853_025_11441_0 crossref_primary_10_1038_s41597_025_05309_w crossref_primary_10_3390_app15158321 crossref_primary_10_1007_s11629_024_9429_7 crossref_primary_10_1371_journal_pcbi_1012550 crossref_primary_10_1007_s44378_025_00086_9 crossref_primary_10_1109_ACCESS_2025_3548167 crossref_primary_10_1021_acs_est_5c03992 crossref_primary_10_1109_ACCESS_2025_3530927 crossref_primary_10_1109_TON_2025_3526228 crossref_primary_10_3390_informatics12020040 crossref_primary_10_3846_jbem_2023_19775 crossref_primary_10_1016_j_jclepro_2024_144621 crossref_primary_10_1016_j_knosys_2025_112979 crossref_primary_10_1016_j_ces_2025_122218 crossref_primary_10_1016_j_compag_2025_110941 crossref_primary_10_1016_j_apsb_2025_02_009 crossref_primary_10_1016_j_insi_2025_100062 crossref_primary_10_1016_j_future_2025_107719 crossref_primary_10_3390_agriengineering6020103 crossref_primary_10_1016_j_atech_2024_100726 crossref_primary_10_1016_j_eswa_2025_127326 crossref_primary_10_1016_j_chemolab_2024_105278 crossref_primary_10_3390_computers13100253 crossref_primary_10_1016_j_engappai_2024_109404 crossref_primary_10_1080_00051144_2025_2480423 crossref_primary_10_1016_j_est_2024_110560 crossref_primary_10_1038_s41598_025_92223_1 crossref_primary_10_1109_ACCESS_2025_3578528 crossref_primary_10_2478_ijssis_2025_0011 crossref_primary_10_1007_s42979_025_03736_5 crossref_primary_10_1016_j_envres_2024_120683 crossref_primary_10_1038_s41597_024_03574_9 crossref_primary_10_1016_j_heliyon_2024_e38910 crossref_primary_10_1146_annurev_biodatasci_103123_094601 crossref_primary_10_1109_TGRS_2025_3562257 crossref_primary_10_1016_j_biotechadv_2025_108698 crossref_primary_10_1007_s10664_025_10631_3 crossref_primary_10_1007_s10462_025_11275_x crossref_primary_10_12677_ecl_2025_1462028 crossref_primary_10_1109_JSEN_2024_3453326 crossref_primary_10_22399_ijcesen_3110 crossref_primary_10_1016_j_ijfatigue_2025_108965 crossref_primary_10_3390_math13172859 crossref_primary_10_1016_j_neucom_2024_127493 crossref_primary_10_3390_su152215761 crossref_primary_10_1016_j_intermet_2025_108921 crossref_primary_10_1109_ACCESS_2025_3601031 crossref_primary_10_1007_s12665_024_11600_7 crossref_primary_10_1016_j_enconman_2025_120001 crossref_primary_10_3390_rs15205040 crossref_primary_10_3390_bioengineering12090908 crossref_primary_10_1007_s13042_025_02546_8 crossref_primary_10_1016_j_asr_2024_11_062 crossref_primary_10_3390_electronics14112248 crossref_primary_10_1016_j_atech_2025_100923 crossref_primary_10_3390_info16060474 crossref_primary_10_1016_j_array_2025_100380 crossref_primary_10_3390_electronics14142831 crossref_primary_10_1007_s10845_025_02646_w crossref_primary_10_1016_j_jbi_2025_104812 crossref_primary_10_1016_j_dib_2024_110821 crossref_primary_10_3390_app15020933 crossref_primary_10_3390_catal15090842 crossref_primary_10_1038_s44303_025_00092_0 crossref_primary_10_3389_frobt_2024_1434351 crossref_primary_10_1038_s41598_024_84673_w crossref_primary_10_1016_j_hazadv_2025_100699 crossref_primary_10_3390_app142411978 crossref_primary_10_1109_ACCESS_2024_3414651 crossref_primary_10_3389_frai_2025_1621514 crossref_primary_10_3390_eng5040172 crossref_primary_10_1007_s13369_025_10276_w crossref_primary_10_1016_j_jnucmat_2025_156126 crossref_primary_10_1016_j_iot_2025_101753 crossref_primary_10_1016_j_jss_2024_112058 crossref_primary_10_1016_j_aquaculture_2025_742303 crossref_primary_10_1007_s10710_024_09501_6 crossref_primary_10_3389_frai_2025_1640805 crossref_primary_10_1016_j_actaastro_2025_04_040 crossref_primary_10_1007_s41207_024_00659_0 crossref_primary_10_1109_ACCESS_2024_3411091 crossref_primary_10_1016_j_ymssp_2024_111103 crossref_primary_10_1016_j_engappai_2024_109170 crossref_primary_10_1109_ACCESS_2024_3491856 |
| Cites_doi | 10.1109/ASRU.2015.7404808 10.1007/978-3-319-11955-7_72 10.1016/j.dss.2018.03.011 10.1145/3190578 10.1007/978-3-319-10602-1_48 10.1186/s40537-021-00468-0 10.1109/BigDataCongress.2018.00029 10.21437/Interspeech.2016-805 10.1109/CVPR.2009.5206848 10.1007/s10676-021-09608-9 10.1016/j.patter.2021.100241 10.1145/3592616 10.1109/TPAMI.2017.2723009 10.1186/s40537-021-00439-5 10.1145/3592786 10.1007/3-540-45153-6_7 10.1109/TSE.2015.2479217 10.1109/ICBDCI.2019.8686099 10.18653/v1/D19-1018 10.1145/1060745.1060764 10.1007/s11263-009-0275-4 10.1016/j.future.2018.07.014 10.1109/INNOVATIONS.2018.8605945 |
| ContentType | Journal Article |
| Copyright | 2023 The Authors |
| Copyright_xml | – notice: 2023 The Authors |
| DBID | 6I. AAFTH AAYXX CITATION |
| DOI | 10.1016/j.infsof.2023.107268 |
| DatabaseName | ScienceDirect Open Access Titles Elsevier:ScienceDirect:Open Access CrossRef |
| DatabaseTitle | CrossRef |
| DatabaseTitleList | |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Business |
| EISSN | 1873-6025 |
| ExternalDocumentID | 10_1016_j_infsof_2023_107268 S0950584923001222 |
| GroupedDBID | --K --M -~X .DC .~1 0R~ 1B1 1~. 1~5 29I 4.4 457 4G. 5GY 5VS 6I. 7-5 71M 77K 8P~ 9JN AABNK AACTN AAEDT AAEDW AAFTH AAIAV AAIKJ AAKOC AALRI AAOAW AAQFI AAQXK AAXUO AAYFN AAYOK ABBOA ABFNM ABFRF ABJNI ABMAC ABTAH ABXDB ABYKQ ACDAQ ACGFO ACGFS ACGOD ACNNM ACRLP ACZNC ADBBV ADEZE ADJOM ADMUD AEBSH AEFWE AEKER AENEX AFKWA AFTJW AGHFR AGUBO AGYEJ AHHHB AHZHX AIALX AIEXJ AIKHN AITUG AJBFU AJOXV ALMA_UNASSIGNED_HOLDINGS AMFUW AMRAJ AOUOD ASPBG AVWKF AXJTR AZFZN BKOJK BKOMP BLXMC CS3 DU5 EBS EFJIC EFLBG EJD EO8 EO9 EP2 EP3 FDB FEDTE FGOYB FIRID FNPLU FYGXN G-Q G8K GBLVA GBOLZ HLZ HVGLF HZ~ IHE J1W KOM LG9 M41 MO0 MS~ N9A O-L O9- OAUVE OZT P-8 P-9 P2P PC. PQQKQ Q38 R2- RIG ROL RPZ SBC SDF SDG SDP SES SEW SPC SPCBC SSV SSZ T5K TWZ UHS UNMZH WH7 WUQ XFK ZY4 ~G- 77I 9DU AATTM AAXKI AAYWO AAYXX ABDPE ABWVN ACLOT ACRPL ACVFH ADCNI ADNMO AEIPS AEUPX AFJKZ AFPUW AGQPQ AIGII AIIUN AKBMS AKRWK AKYEP ANKPU APXCP CITATION EFKBS ~HD |
| ID | FETCH-LOGICAL-c306t-adfb34a77e889ec4a4d07e24dfe469f42a0d064b76ac5e65f623013380f851b13 |
| ISICitedReferencesCount | 131 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001035352200001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 0950-5849 |
| IngestDate | Sat Nov 29 07:07:48 EST 2025 Tue Nov 18 21:19:01 EST 2025 Fri Feb 23 02:36:40 EST 2024 |
| IsDoiOpenAccess | true |
| IsOpenAccess | true |
| IsPeerReviewed | true |
| IsScholarly | true |
| Keywords | Dataset quality Machine Learning Dataset |
| Language | English |
| License | This is an open access article under the CC BY-NC-ND license. |
| LinkModel | OpenURL |
| MergedId | FETCHMERGED-LOGICAL-c306t-adfb34a77e889ec4a4d07e24dfe469f42a0d064b76ac5e65f623013380f851b13 |
| OpenAccessLink | https://dx.doi.org/10.1016/j.infsof.2023.107268 |
| ParticipantIDs | crossref_citationtrail_10_1016_j_infsof_2023_107268 crossref_primary_10_1016_j_infsof_2023_107268 elsevier_sciencedirect_doi_10_1016_j_infsof_2023_107268 |
| PublicationCentury | 2000 |
| PublicationDate | October 2023 2023-10-00 |
| PublicationDateYYYYMMDD | 2023-10-01 |
| PublicationDate_xml | – month: 10 year: 2023 text: October 2023 |
| PublicationDecade | 2020 |
| PublicationTitle | Information and software technology |
| PublicationYear | 2023 |
| Publisher | Elsevier B.V |
| Publisher_xml | – name: Elsevier B.V |
| References | Zhang, Zhu, Wright (b42) 2018 Lang (b2) 1995; 1995 M. Abdallah, Big Data Quality Challenges, in: 2019 International Conference on Big Data and Computational Intelligence, ICBDCI, 2019. Mohan, Jianzhong (b66) 2016 (b18) 2002 He, Yang, Zhang (b67) 2020 Maas, Daly, Pham (b25) 2011 Heinrich, Klier, Schiller, G. (b63) 2018; 110 Luong, Singh, Ramezani (b61) 2019; 3 G.D. Corso, A. Gullí, F. Romani, Ranking a stream of news, in: International Conference on World Wide Web, DBLP, 2005, p. 97. Mirakhorli, Cleland-Huang (b41) 2016; 42 Chang (b56) 2022 Wook, Hasbullah, Zainudin (b57) 2021; 8 Socher, Perelygin, Wu (b6) 2013 Scantamburlo (b58) 2021; 23 Picard, Chapdelaine, Cappi (b51) 2020 I. Taleb, M.A. Serhani, R. Dssouli, Big Data Quality Assessment Model for Unstructured Data, in: IIT 2018 : 13th International Conference on Innovations in Information Technology, 2018. Diaz, Bavota, Marcus (b38) 2013 Taleb, Serhani, Bouhaddioui, Dssouli (b35) 2021; 8 . J. Deng, W. Dong, R. Socher, et al., ImageNet : A Large-Scale Hierarchical Image Database, in: Proc. CVPR, Vol. 2009, 2009. Hongxun, Honggang, Kun (b53) 2018 I. Taleb, M.A. Serhani, R. Dssouli, Big Data Quality: A Survey, in: Big Data Congress 2018, 2018. Panayotov, Chen, Povey (b17) 2015 Northcutt, Jiang, Chuang (b28) 2021 Zogaan, Sharma, Mirahkorli (b40) 2017 J. Priem, D. Taraborelli, P. Groth, et al. Altmetrics: A manifesto. [2010-10-26]. Takahashi, Gygli, Pfister (b24) 2016 Ardagna, Cappiello, Samá (b64) 2018; 89 Yulin, Yi, Dexin, Baihao, Jiajie (b48) 2021; 38 Garofolo (b20) 1993 Zog Aa, Sharma, Mirahkorli (b65) 2017 Nene, Nayar, Murase (b15) 1996 Zhou, Lapedriza, Khosla, Oliva, Torralba (b12) 2018; 40 Hooker (b59) 2021; 2 J. Ni, J. Li, J. Mcauley, Justifying Recommendations using Distantly-Labeled Reviews and Fine-Grained Aspects, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP, 2019. Guo, Jazaery (b32) 2019; 9 Escudero, Novoa, Mahu (b45) 2018 T.Y. Lin, M. Maire, S. Belongie, et al., Microsoft COCO: Common Objects in Context, in: European Conference on Computer Vision, 2014. Priestley, O’Donnell, Simperl (b54) 2023 Nehmé, Delanoy, Dupont, Farrugia, Callet, Lavoué (b55) 2023 (b37) 2018 Fabbrizzi, Papadopoulos, Ntoutsi (b31) 2021 Christian, Theresia, Simone (b30) 2018; 10 Song, Shuang, Guo (b27) 2013 Cai, Wang, Liu, Zhu (b49) 2020; 31 E. Ruckhaus, M. Vidal, S. Castillo, et al., Analyzing linked data quality with LiQuate, in: Proc. of the European Semantic Web Conf., 2014, pp. 488–493. Xie, Guo, Gao (b33) 2020; 2020 C. Lin, ROUGE:A package for automatic evaluation of summaries, in: Proc. of the Meeting of the Association for Computational Linguistics, 2004, pp. 74–81. Gervasi, Zowghi (b39) 2014 Chug, Kaushal, Kumaraguru (b52) 2021 Jin, Wei, Ding (b69) 2004 Shi, Zhang, Ge (b60) 2019 Snyder, Chen, Povey (b22) 2015 Li, Goh, Jin (b62) 2018 Li, Song, Xu (b36) 2020 Li, Lee, Gao, Huang (b5) 2013; vol. 400 Everingham, Van Gool, Williams, Winn, Zisserman (b9) 2010; 88 Ju, chun, Jian (b71) 2001; 21 Rosli, Tempero, Luxton-Reilly (b29) 2018; 24 N. Japkowicz, Concept-Learning in the Presence of Between-Class and Within-Class Imbalances, in: Proceedings of the 14th Biennial Conference of the Canadian Society on Computational Studies of Intelligence: Advances in Artificial Intelligence, 2001, pp. 67–77. GB/T 36344-2018 Information technology—Evaluation indicators for data quality. Birodkar, Mobahi, Bengio (b68) 2019 Krizhevsky, Hinton (b13) 2009 Liu, Luo, Wang (b16) 2014 N. Ruiz, M. Federico, Phonetically-oriented word error alignment for speech recognition error analysis in speech translation, in: Proc. of the Automatic Speech Recognition and Understanding, 2016, pp. 296–302. Cai (10.1016/j.infsof.2023.107268_b49) 2020; 31 Garofolo (10.1016/j.infsof.2023.107268_b20) 1993 He (10.1016/j.infsof.2023.107268_b67) 2020 Chug (10.1016/j.infsof.2023.107268_b52) 2021 Liu (10.1016/j.infsof.2023.107268_b16) 2014 Birodkar (10.1016/j.infsof.2023.107268_b68) 2019 Nehmé (10.1016/j.infsof.2023.107268_b55) 2023 Li (10.1016/j.infsof.2023.107268_b5) 2013; vol. 400 Guo (10.1016/j.infsof.2023.107268_b32) 2019; 9 Shi (10.1016/j.infsof.2023.107268_b60) 2019 Wook (10.1016/j.infsof.2023.107268_b57) 2021; 8 10.1016/j.infsof.2023.107268_b70 Picard (10.1016/j.infsof.2023.107268_b51) 2020 Mohan (10.1016/j.infsof.2023.107268_b66) 2016 Chang (10.1016/j.infsof.2023.107268_b56) 2022 Maas (10.1016/j.infsof.2023.107268_b25) 2011 10.1016/j.infsof.2023.107268_b34 Hooker (10.1016/j.infsof.2023.107268_b59) 2021; 2 Diaz (10.1016/j.infsof.2023.107268_b38) 2013 Ardagna (10.1016/j.infsof.2023.107268_b64) 2018; 89 Takahashi (10.1016/j.infsof.2023.107268_b24) 2016 Christian (10.1016/j.infsof.2023.107268_b30) 2018; 10 Escudero (10.1016/j.infsof.2023.107268_b45) 2018 Scantamburlo (10.1016/j.infsof.2023.107268_b58) 2021; 23 Yulin (10.1016/j.infsof.2023.107268_b48) 2021; 38 Lang (10.1016/j.infsof.2023.107268_b2) 1995; 1995 Zhou (10.1016/j.infsof.2023.107268_b12) 2018; 40 Jin (10.1016/j.infsof.2023.107268_b69) 2004 10.1016/j.infsof.2023.107268_b46 10.1016/j.infsof.2023.107268_b43 10.1016/j.infsof.2023.107268_b44 10.1016/j.infsof.2023.107268_b47 10.1016/j.infsof.2023.107268_b19 Mirakhorli (10.1016/j.infsof.2023.107268_b41) 2016; 42 Snyder (10.1016/j.infsof.2023.107268_b22) 2015 Rosli (10.1016/j.infsof.2023.107268_b29) 2018; 24 Song (10.1016/j.infsof.2023.107268_b27) 2013 Zhang (10.1016/j.infsof.2023.107268_b42) 2018 Everingham (10.1016/j.infsof.2023.107268_b9) 2010; 88 Zog Aa (10.1016/j.infsof.2023.107268_b65) 2017 10.1016/j.infsof.2023.107268_b50 10.1016/j.infsof.2023.107268_b10 10.1016/j.infsof.2023.107268_b11 Xie (10.1016/j.infsof.2023.107268_b33) 2020; 2020 Hongxun (10.1016/j.infsof.2023.107268_b53) 2018 10.1016/j.infsof.2023.107268_b14 Taleb (10.1016/j.infsof.2023.107268_b35) 2021; 8 10.1016/j.infsof.2023.107268_b7 10.1016/j.infsof.2023.107268_b8 Heinrich (10.1016/j.infsof.2023.107268_b63) 2018; 110 10.1016/j.infsof.2023.107268_b1 Northcutt (10.1016/j.infsof.2023.107268_b28) 2021 10.1016/j.infsof.2023.107268_b3 (10.1016/j.infsof.2023.107268_b37) 2018 10.1016/j.infsof.2023.107268_b4 Li (10.1016/j.infsof.2023.107268_b36) 2020 Socher (10.1016/j.infsof.2023.107268_b6) 2013 Gervasi (10.1016/j.infsof.2023.107268_b39) 2014 Panayotov (10.1016/j.infsof.2023.107268_b17) 2015 Ju (10.1016/j.infsof.2023.107268_b71) 2001; 21 Li (10.1016/j.infsof.2023.107268_b62) 2018 Krizhevsky (10.1016/j.infsof.2023.107268_b13) 2009 Nene (10.1016/j.infsof.2023.107268_b15) 1996 (10.1016/j.infsof.2023.107268_b18) 2002 Luong (10.1016/j.infsof.2023.107268_b61) 2019; 3 10.1016/j.infsof.2023.107268_b23 10.1016/j.infsof.2023.107268_b21 Fabbrizzi (10.1016/j.infsof.2023.107268_b31) 2021 Priestley (10.1016/j.infsof.2023.107268_b54) 2023 10.1016/j.infsof.2023.107268_b26 Zogaan (10.1016/j.infsof.2023.107268_b40) 2017 |
| References_xml | – volume: 40 start-page: 1452 year: 2018 end-page: 1464 ident: b12 article-title: Places: A 10 million image database for scene recognition publication-title: IEEE Trans. Pattern Anal. Mach. Intell. – year: 2014 ident: b16 article-title: Deep learning face attributes in the wild – year: 2020 ident: b36 article-title: Studies on data quality evaluation index system for internet plus government services in big data era – volume: 38 start-page: 170 year: 2021 end-page: 179 ident: b48 article-title: A new method for measuring the distribution consistency of mixed-attribute datasets publication-title: J. Shenzhen Univ. (Sci. Technol. Ed.) – reference: E. Ruckhaus, M. Vidal, S. Castillo, et al., Analyzing linked data quality with LiQuate, in: Proc. of the European Semantic Web Conf., 2014, pp. 488–493. – reference: J. Ni, J. Li, J. Mcauley, Justifying Recommendations using Distantly-Labeled Reviews and Fine-Grained Aspects, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP, 2019. – year: 2020 ident: b67 article-title: Sample-efficient deep learning for COVID-19 diagnosis based on CT scans – volume: 8 start-page: 1 year: 2021 end-page: 41 ident: b35 article-title: Big data quality framework: A holistic approach to continuous quality management publication-title: J. Big Data – volume: 2020 year: 2020 ident: b33 article-title: Conceptual cognitive modeling for fine-grained annotation quality assessment of object detection datasets publication-title: Discrete Dyn. Nat. Soc. – year: 2021 ident: b28 article-title: Confident learning: Estimating uncertainty in dataset labels – year: 2017 ident: b65 article-title: Datasets used in fifteen years of automated requirements traceability research – year: 2002 ident: b18 article-title: Linguistic data consortium – reference: N. Ruiz, M. Federico, Phonetically-oriented word error alignment for speech recognition error analysis in speech translation, in: Proc. of the Automatic Speech Recognition and Understanding, 2016, pp. 296–302. – year: 2023 ident: b54 article-title: A survey of data quality requirements that matter in ML development pipelines publication-title: J. Data Inf. Qual. – year: 2023 ident: b55 article-title: Textured mesh quality assessment: Large-scale dataset and deep learning-based quality metric publication-title: ACM Trans. Graph. – year: 2016 ident: b24 article-title: Deep convolutional neural networks and data augmentation for acoustic event recognition publication-title: Interspeech – year: 2014 ident: b39 article-title: Supporting traceability through affinity mining publication-title: Requirements Engineering Conference – year: 2018 ident: b42 article-title: Training set debugging using trusted items – start-page: 1 year: 2019 end-page: 8 ident: b60 article-title: An association-based intrinsic quality index for healthcare dataset ranking publication-title: 2019 IEEE International Conference on Healthcare Informatics – reference: I. Taleb, M.A. Serhani, R. Dssouli, Big Data Quality Assessment Model for Unstructured Data, in: IIT 2018 : 13th International Conference on Innovations in Information Technology, 2018. – volume: 1995 start-page: 331 year: 1995 end-page: 339 ident: b2 article-title: NewsWeeder: Learning to filter netnews publication-title: Mach. Learn. Proc. – reference: M. Abdallah, Big Data Quality Challenges, in: 2019 International Conference on Big Data and Computational Intelligence, ICBDCI, 2019. – year: 2017 ident: b40 article-title: Datasets from fifteen years of automated requirements traceability research: Current state, characteristics, and quality publication-title: Requirements Engineering Conference – volume: 31 start-page: 302 year: 2020 end-page: 320 ident: b49 article-title: Survey of data annotation publication-title: J. Softw. – year: 2013 ident: b38 article-title: Using code ownership to improve IR-based traceability link recovery, program comprehension (ICPC) publication-title: 2013 IEEE 21st International Conference on – volume: 2 year: 2021 ident: b59 article-title: Moving beyond algorithmic bias is a data problem publication-title: Patterns – year: 2018 ident: b62 article-title: How textual quality of online reviews affect classification performance:A case of deep learning sentiment analysis publication-title: Neural Comput. Appl. – volume: 88 start-page: 303 year: 2010 end-page: 338 ident: b9 publication-title: Int. J. Comput. Vis. – volume: 9 year: 2019 ident: b32 article-title: Automated cleaning of identity label noise in a large face dataset with quality control publication-title: IET Biometrics – year: 2022 ident: b56 article-title: ISO/IEC JTC 1/SC 42(AI)/WG 2(data) data quality for analytics and machine learning (ML) – year: 2013 ident: b27 article-title: Data quality and data cleaning methods – year: 2021 ident: b31 article-title: A survey on bias in visual datasets – year: 2016 ident: b66 article-title: Data currency determination: Key theories and technologies publication-title: Intell. Comput. Appl. – volume: 24 start-page: 7232 year: 2018 end-page: 7239 ident: b29 article-title: Evaluating the quality of datasets in software engineering publication-title: J. Comput. Theor. Nanosci. – year: 2015 ident: b22 article-title: MUSAN: A music, speech, and noise corpus publication-title: Comput. Sci. – year: 2011 ident: b25 article-title: Learning Word Vectors for Sentiment Analysis – reference: GB/T 36344-2018 Information technology—Evaluation indicators for data quality. – reference: J. Priem, D. Taraborelli, P. Groth, et al. Altmetrics: A manifesto. [2010-10-26]. – volume: 8 start-page: 1 year: 2021 end-page: 15 ident: b57 article-title: Exploring big data traits and data quality dimensions for big data analytics application using partial least squares structural equation modelling publication-title: J. Big Data – year: 2019 ident: b68 article-title: Semantic redundancies in image-classification datasets: The 10% you don’t need – volume: 21 start-page: 43 year: 2001 end-page: 48 ident: b71 article-title: New study on determining the weight of index in synthetic weighted mark method publication-title: Syst. Eng.-Theory Pract. – volume: vol. 400 year: 2013 ident: b5 article-title: Semi-supervised text categorization by considering sufficiency and diversity publication-title: Natural Language Processing and Chinese Computing – reference: J. Deng, W. Dong, R. Socher, et al., ImageNet : A Large-Scale Hierarchical Image Database, in: Proc. CVPR, Vol. 2009, 2009. – volume: 110 start-page: 95 year: 2018 end-page: 106 ident: b63 article-title: Assessing data quality–A probability-based metric for semantic consistency publication-title: Decis. Support Syst. – year: 2018 ident: b45 article-title: An improved DNN-based spectral feature mapping that removes noise and reverberation for robust automatic speech recognition – year: 1993 ident: b20 article-title: TIMIT acoustic-phonetic continuous speech corpus LDC93s1 – volume: 42 start-page: 1 year: 2016 ident: b41 article-title: Detecting, tracing, and monitoring architectural tactics in code publication-title: IEEE Trans Softw Eng – year: 2021 ident: b52 article-title: Statistical learning to operationalize a domain agnostic data quality scoring – year: 1996 ident: b15 article-title: Columbia Object Image Library (COIL-100) – reference: C. Lin, ROUGE:A package for automatic evaluation of summaries, in: Proc. of the Meeting of the Association for Computational Linguistics, 2004, pp. 74–81. – volume: 23 start-page: 703 year: 2021 end-page: 712 ident: b58 article-title: Non-empirical problems in fair machine learning publication-title: Ethics Inf. Technol. – year: 2013 ident: b6 article-title: Recursive deep models for semantic compositionality over a sentiment treebank publication-title: Empirical Methods in Natural Language Processing – year: 2009 ident: b13 article-title: Learning multiple layers of features from tiny images publication-title: Handbook of Systemic Autoimmune Diseases, Vol. 1, no. 4 – reference: N. Japkowicz, Concept-Learning in the Presence of Between-Class and Within-Class Imbalances, in: Proceedings of the 14th Biennial Conference of the Canadian Society on Computational Studies of Intelligence: Advances in Artificial Intelligence, 2001, pp. 67–77. – reference: G.D. Corso, A. Gullí, F. Romani, Ranking a stream of news, in: International Conference on World Wide Web, DBLP, 2005, p. 97. – year: 2015 ident: b17 article-title: Librispeech: An ASR corpus based on public domain audio books publication-title: ICASSP 2015-2015 IEEE International Conference on Acoustics, Speech and Signal Processing – year: 2020 ident: b51 article-title: Ensuring Dataset Quality for Machine Learning Certification – reference: I. Taleb, M.A. Serhani, R. Dssouli, Big Data Quality: A Survey, in: Big Data Congress 2018, 2018. – reference: . – year: 2018 ident: b37 article-title: Construction of big data quality measurement model publication-title: Information Studies:Theory and Application – start-page: 248 year: 2018 end-page: 252 ident: b53 article-title: Data quality assessment for on-line monitoring and measuring system of power quality based on big data and data provenance theory – volume: 10 start-page: 1 year: 2018 end-page: 26 ident: b30 article-title: Visual interactive creation, customization, and analysis of data quality metrics publication-title: J. Data Inf. Qual. – volume: 89 start-page: 548 year: 2018 end-page: 562 ident: b64 article-title: Context-aware data quality assessment for big data publication-title: Future Gener. Comput. Syst. – volume: 3 start-page: 1 year: 2019 end-page: 19 ident: b61 article-title: longSil: An evaluation metric to assess quality of clustering longitudinal clinical data publication-title: J. Healthc. Inf. Res. – start-page: 144 year: 2004 end-page: 147 ident: b69 article-title: Fuzzy comprehensive evaluation model based on improved analytic hierarchy process publication-title: J. Hydraul. Eng. – reference: T.Y. Lin, M. Maire, S. Belongie, et al., Microsoft COCO: Common Objects in Context, in: European Conference on Computer Vision, 2014. – year: 2020 ident: 10.1016/j.infsof.2023.107268_b67 – year: 2013 ident: 10.1016/j.infsof.2023.107268_b27 – ident: 10.1016/j.infsof.2023.107268_b44 doi: 10.1109/ASRU.2015.7404808 – ident: 10.1016/j.infsof.2023.107268_b43 doi: 10.1007/978-3-319-11955-7_72 – year: 2016 ident: 10.1016/j.infsof.2023.107268_b66 article-title: Data currency determination: Key theories and technologies publication-title: Intell. Comput. Appl. – year: 2018 ident: 10.1016/j.infsof.2023.107268_b62 article-title: How textual quality of online reviews affect classification performance:A case of deep learning sentiment analysis publication-title: Neural Comput. Appl. – volume: 110 start-page: 95 year: 2018 ident: 10.1016/j.infsof.2023.107268_b63 article-title: Assessing data quality–A probability-based metric for semantic consistency publication-title: Decis. Support Syst. doi: 10.1016/j.dss.2018.03.011 – year: 1993 ident: 10.1016/j.infsof.2023.107268_b20 – volume: 10 start-page: 1 issue: 1 year: 2018 ident: 10.1016/j.infsof.2023.107268_b30 article-title: Visual interactive creation, customization, and analysis of data quality metrics publication-title: J. Data Inf. Qual. doi: 10.1145/3190578 – year: 2018 ident: 10.1016/j.infsof.2023.107268_b45 – ident: 10.1016/j.infsof.2023.107268_b14 doi: 10.1007/978-3-319-10602-1_48 – ident: 10.1016/j.infsof.2023.107268_b70 – year: 2013 ident: 10.1016/j.infsof.2023.107268_b38 article-title: Using code ownership to improve IR-based traceability link recovery, program comprehension (ICPC) – year: 2018 ident: 10.1016/j.infsof.2023.107268_b42 – volume: 8 start-page: 1 issue: 1 year: 2021 ident: 10.1016/j.infsof.2023.107268_b35 article-title: Big data quality framework: A holistic approach to continuous quality management publication-title: J. Big Data doi: 10.1186/s40537-021-00468-0 – year: 2020 ident: 10.1016/j.infsof.2023.107268_b36 – year: 2017 ident: 10.1016/j.infsof.2023.107268_b65 – ident: 10.1016/j.infsof.2023.107268_b19 – volume: 3 start-page: 1 issue: 1 year: 2019 ident: 10.1016/j.infsof.2023.107268_b61 article-title: longSil: An evaluation metric to assess quality of clustering longitudinal clinical data publication-title: J. Healthc. Inf. Res. – start-page: 248 year: 2018 ident: 10.1016/j.infsof.2023.107268_b53 – ident: 10.1016/j.infsof.2023.107268_b34 doi: 10.1109/BigDataCongress.2018.00029 – start-page: 1 year: 2019 ident: 10.1016/j.infsof.2023.107268_b60 article-title: An association-based intrinsic quality index for healthcare dataset ranking – year: 2016 ident: 10.1016/j.infsof.2023.107268_b24 article-title: Deep convolutional neural networks and data augmentation for acoustic event recognition publication-title: Interspeech doi: 10.21437/Interspeech.2016-805 – year: 2022 ident: 10.1016/j.infsof.2023.107268_b56 – volume: 21 start-page: 43 issue: 8 year: 2001 ident: 10.1016/j.infsof.2023.107268_b71 article-title: New study on determining the weight of index in synthetic weighted mark method publication-title: Syst. Eng.-Theory Pract. – ident: 10.1016/j.infsof.2023.107268_b8 – ident: 10.1016/j.infsof.2023.107268_b10 doi: 10.1109/CVPR.2009.5206848 – volume: 23 start-page: 703 issue: 4 year: 2021 ident: 10.1016/j.infsof.2023.107268_b58 article-title: Non-empirical problems in fair machine learning publication-title: Ethics Inf. Technol. doi: 10.1007/s10676-021-09608-9 – volume: 2 issue: 4 year: 2021 ident: 10.1016/j.infsof.2023.107268_b59 article-title: Moving beyond algorithmic bias is a data problem publication-title: Patterns doi: 10.1016/j.patter.2021.100241 – year: 2023 ident: 10.1016/j.infsof.2023.107268_b54 article-title: A survey of data quality requirements that matter in ML development pipelines publication-title: J. Data Inf. Qual. doi: 10.1145/3592616 – volume: 40 start-page: 1452 issue: 6 year: 2018 ident: 10.1016/j.infsof.2023.107268_b12 article-title: Places: A 10 million image database for scene recognition publication-title: IEEE Trans. Pattern Anal. Mach. Intell. doi: 10.1109/TPAMI.2017.2723009 – volume: 8 start-page: 1 issue: 1 year: 2021 ident: 10.1016/j.infsof.2023.107268_b57 article-title: Exploring big data traits and data quality dimensions for big data analytics application using partial least squares structural equation modelling publication-title: J. Big Data doi: 10.1186/s40537-021-00439-5 – year: 2023 ident: 10.1016/j.infsof.2023.107268_b55 article-title: Textured mesh quality assessment: Large-scale dataset and deep learning-based quality metric publication-title: ACM Trans. Graph. doi: 10.1145/3592786 – volume: 2020 year: 2020 ident: 10.1016/j.infsof.2023.107268_b33 article-title: Conceptual cognitive modeling for fine-grained annotation quality assessment of object detection datasets publication-title: Discrete Dyn. Nat. Soc. – year: 2014 ident: 10.1016/j.infsof.2023.107268_b16 – ident: 10.1016/j.infsof.2023.107268_b23 – year: 2011 ident: 10.1016/j.infsof.2023.107268_b25 – ident: 10.1016/j.infsof.2023.107268_b50 – ident: 10.1016/j.infsof.2023.107268_b7 – year: 1996 ident: 10.1016/j.infsof.2023.107268_b15 – year: 2013 ident: 10.1016/j.infsof.2023.107268_b6 article-title: Recursive deep models for semantic compositionality over a sentiment treebank – ident: 10.1016/j.infsof.2023.107268_b47 doi: 10.1007/3-540-45153-6_7 – volume: 42 start-page: 1 issue: 3 year: 2016 ident: 10.1016/j.infsof.2023.107268_b41 article-title: Detecting, tracing, and monitoring architectural tactics in code publication-title: IEEE Trans Softw Eng doi: 10.1109/TSE.2015.2479217 – year: 2002 ident: 10.1016/j.infsof.2023.107268_b18 – ident: 10.1016/j.infsof.2023.107268_b26 doi: 10.1109/ICBDCI.2019.8686099 – year: 2021 ident: 10.1016/j.infsof.2023.107268_b28 – volume: vol. 400 year: 2013 ident: 10.1016/j.infsof.2023.107268_b5 article-title: Semi-supervised text categorization by considering sufficiency and diversity – year: 2015 ident: 10.1016/j.infsof.2023.107268_b22 article-title: MUSAN: A music, speech, and noise corpus publication-title: Comput. Sci. – year: 2020 ident: 10.1016/j.infsof.2023.107268_b51 – volume: 1995 start-page: 331 year: 1995 ident: 10.1016/j.infsof.2023.107268_b2 article-title: NewsWeeder: Learning to filter netnews publication-title: Mach. Learn. Proc. – year: 2021 ident: 10.1016/j.infsof.2023.107268_b31 – year: 2018 ident: 10.1016/j.infsof.2023.107268_b37 article-title: Construction of big data quality measurement model – year: 2017 ident: 10.1016/j.infsof.2023.107268_b40 article-title: Datasets from fifteen years of automated requirements traceability research: Current state, characteristics, and quality – volume: 9 issue: 1 year: 2019 ident: 10.1016/j.infsof.2023.107268_b32 article-title: Automated cleaning of identity label noise in a large face dataset with quality control publication-title: IET Biometrics – year: 2009 ident: 10.1016/j.infsof.2023.107268_b13 article-title: Learning multiple layers of features from tiny images – ident: 10.1016/j.infsof.2023.107268_b4 doi: 10.18653/v1/D19-1018 – ident: 10.1016/j.infsof.2023.107268_b11 – start-page: 144 issue: 2 year: 2004 ident: 10.1016/j.infsof.2023.107268_b69 article-title: Fuzzy comprehensive evaluation model based on improved analytic hierarchy process publication-title: J. Hydraul. Eng. – volume: 38 start-page: 170 issue: 02 year: 2021 ident: 10.1016/j.infsof.2023.107268_b48 article-title: A new method for measuring the distribution consistency of mixed-attribute datasets publication-title: J. Shenzhen Univ. (Sci. Technol. Ed.) – ident: 10.1016/j.infsof.2023.107268_b3 doi: 10.1145/1060745.1060764 – volume: 24 start-page: 7232 issue: 10 year: 2018 ident: 10.1016/j.infsof.2023.107268_b29 article-title: Evaluating the quality of datasets in software engineering publication-title: J. Comput. Theor. Nanosci. – year: 2021 ident: 10.1016/j.infsof.2023.107268_b52 – volume: 88 start-page: 303 issue: 2 year: 2010 ident: 10.1016/j.infsof.2023.107268_b9 publication-title: Int. J. Comput. Vis. doi: 10.1007/s11263-009-0275-4 – year: 2014 ident: 10.1016/j.infsof.2023.107268_b39 article-title: Supporting traceability through affinity mining – ident: 10.1016/j.infsof.2023.107268_b21 – ident: 10.1016/j.infsof.2023.107268_b46 – volume: 89 start-page: 548 issue: DEC. year: 2018 ident: 10.1016/j.infsof.2023.107268_b64 article-title: Context-aware data quality assessment for big data publication-title: Future Gener. Comput. Syst. doi: 10.1016/j.future.2018.07.014 – year: 2015 ident: 10.1016/j.infsof.2023.107268_b17 article-title: Librispeech: An ASR corpus based on public domain audio books – ident: 10.1016/j.infsof.2023.107268_b1 doi: 10.1109/INNOVATIONS.2018.8605945 – year: 2019 ident: 10.1016/j.infsof.2023.107268_b68 – volume: 31 start-page: 302 issue: 2 year: 2020 ident: 10.1016/j.infsof.2023.107268_b49 article-title: Survey of data annotation publication-title: J. Softw. |
| SSID | ssj0017030 |
| Score | 2.6682303 |
| Snippet | With the rise of big data, the quality of datasets has become a crucial factor affecting the performance of machine learning models. High-quality datasets are... |
| SourceID | crossref elsevier |
| SourceType | Enrichment Source Index Database Publisher |
| StartPage | 107268 |
| SubjectTerms | Dataset Dataset quality Machine Learning |
| Title | A survey on dataset quality in machine learning |
| URI | https://dx.doi.org/10.1016/j.infsof.2023.107268 |
| Volume | 162 |
| WOSCitedRecordID | wos001035352200001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVESC databaseName: Elsevier SD Freedom Collection Journals 2021 customDbUrl: eissn: 1873-6025 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0017030 issn: 0950-5849 databaseCode: AIEXJ dateStart: 19950101 isFulltext: true titleUrlDefault: https://www.sciencedirect.com providerName: Elsevier |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1LT-MwELaWhxAXxC6g5bXygRtySfNycqwQ-9Kq4gCrcoocx4FUrFs1CSB-PePYTgpF7HLYS1RFjvP43Mk3k29mEDriocvSoM8Ic4VD_NBLSZylIXGBmqaqIVzefDH9_YsOh9FoFJ-bggpl006AShk9PMTT_wo17AOwVersO-BuJ4Ud8BtAhy3ADtt_An5wXNazO_irA65K_1kKmzrZpPj9adSTwraLuJ5npyY3qbIS5RJs9L2ShlULAfhvRsoL1iIrWlVPUTdx9prJ68ebLslsVDdh06taPt7MDW7grYv5wIPbSdi6CKJDgL_Ez4ypsa3aHIJv6equOQuWWgcNxsq9gJvpqRP0uuHPC2O_eGG1MkKrUBsnepZEzZLoWZbQikuDGAzdyuDH2ehn-2lJmThdgFFfvc2nbER_i1fzOl-Z4yAXm2jDOA94oEH_iD4I-Qmt2dyFLXQywBp7PJHYYI8N9riQ2GCPLfbb6PLr2cXpd2I6YhAOrl1FWJanns8oFVEUC-4zP3OocP0sF34Y577LnAw4ZkpDxgMRBjmQW0dFIZwcmHXa93bQspxI8RnhUCiurhwGcAD6ecwcEXB4C0cp55R7_V3k2ftOuCkXr7qW3CZvPfVdRNqjprpcyl_GU_tIE0P5NJVLYJ28eeTeO8-0j9a7RXyAlqtZLQ7RKr-rinL2xSySJ6-gd5U |
| linkProvider | Elsevier |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=A+survey+on+dataset+quality+in+machine+learning&rft.jtitle=Information+and+software+technology&rft.au=Gong%2C+Youdi&rft.au=Liu%2C+Guangzhen&rft.au=Xue%2C+Yunzhi&rft.au=Li%2C+Rui&rft.date=2023-10-01&rft.issn=0950-5849&rft.volume=162&rft.spage=107268&rft_id=info:doi/10.1016%2Fj.infsof.2023.107268&rft.externalDBID=n%2Fa&rft.externalDocID=10_1016_j_infsof_2023_107268 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0950-5849&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0950-5849&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0950-5849&client=summon |