How Can We Improve Data Quality for Machine Learning? A Visual Analytics System using Data and Process-driven Strategies
ML (Machine learning) models are used to mine inconspicuous information in big data. The model and data quality influence the performance of a machine-learning model. However, it is inefficient to modify the model, which is a black box, and low-quality data tends to cause biased learning of the mode...
Uloženo v:
| Vydáno v: | IEEE Pacific Visualization Symposium s. 112 - 121 |
|---|---|
| Hlavní autoři: | , , , |
| Médium: | Konferenční příspěvek |
| Jazyk: | angličtina |
| Vydáno: |
IEEE
01.04.2023
|
| Témata: | |
| ISSN: | 2165-8773 |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Abstract | ML (Machine learning) models are used to mine inconspicuous information in big data. The model and data quality influence the performance of a machine-learning model. However, it is inefficient to modify the model, which is a black box, and low-quality data tends to cause biased learning of the model. Therefore, it is crucial to improve the data quality. Different techniques have been used to improve data quality depending on the data conditions and the data quality issues. Therefore, improving data quality is time-consuming and challenging for users with insufficient knowledge of data. Visual analytics techniques have been proposed to focus on decision support to improve data quality. However, existing studies are complicated for users to consider a comprehensive DQI (Data Quality Improvement) method for generating data suitable for ML models. Also, it remains limited in that users must directly consider all combinations of DQI processes. This paper presents a novel visual analytics system that manages data quality for use in ML models. The proposed system suggests an optimal quality improvement process with visualization techniques such as heatmap, histogram, and scatter plot to support DQI. |
|---|---|
| AbstractList | ML (Machine learning) models are used to mine inconspicuous information in big data. The model and data quality influence the performance of a machine-learning model. However, it is inefficient to modify the model, which is a black box, and low-quality data tends to cause biased learning of the model. Therefore, it is crucial to improve the data quality. Different techniques have been used to improve data quality depending on the data conditions and the data quality issues. Therefore, improving data quality is time-consuming and challenging for users with insufficient knowledge of data. Visual analytics techniques have been proposed to focus on decision support to improve data quality. However, existing studies are complicated for users to consider a comprehensive DQI (Data Quality Improvement) method for generating data suitable for ML models. Also, it remains limited in that users must directly consider all combinations of DQI processes. This paper presents a novel visual analytics system that manages data quality for use in ML models. The proposed system suggests an optimal quality improvement process with visualization techniques such as heatmap, histogram, and scatter plot to support DQI. |
| Author | Jin, Yejin Jang, Yun Hong, Hyein Yoo, Sangbong |
| Author_xml | – sequence: 1 givenname: Hyein surname: Hong fullname: Hong, Hyein email: gumdung98@naver.com organization: Sejong University,Seoul,South Korea – sequence: 2 givenname: Sangbong surname: Yoo fullname: Yoo, Sangbong email: usangbong@gmail.com organization: Sejong University,Seoul,South Korea – sequence: 3 givenname: Yejin surname: Jin fullname: Jin, Yejin email: yeajin357@gmail.com organization: Sejong University,Seoul,South Korea – sequence: 4 givenname: Yun surname: Jang fullname: Jang, Yun email: jangy@sejong.edu organization: Sejong University,Seoul,South Korea |
| BookMark | eNotzUtPAjEYheFqNBGRf-CicT_Y63S6MgQvkGDE4GVJvul8gzXQIe2Azr-XBFdn8-Q9l-QsNAEJueFsyDmzt3NwvvbuwyedW5kPBRNyyBgT7IQMrLGF1EwKLhQ_JT3Bc50VxsgLMkjp-8CYVVwXeY_8TpofOoZAP5FON9vY7JHeQwv0dQdr33a0biJ9BvflA9IZQgw-rO7oiB6eD4KOAqy71rtEF11qcUN36QCOCQgVncfGYUpZFf0eA120EVpceUxX5LyGdcLB__bJ--PD23iSzV6epuPRLPOCqTary4LbwgFK0AqZqaQRVVlZXZVoMDcAsnSldrUuaw7CWuS1Ua7C2upS8Vz2yfWx6xFxuY1-A7FbcsZVoYyWfwA7ZDw |
| CODEN | IEEPAD |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IL CBEJK RIE RIL |
| DOI | 10.1109/PacificVis56936.2023.00020 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Xplore IEEE Proceedings Order Plans (POP All) 1998-Present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Xplore url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Engineering |
| EISBN | 9798350321241 |
| EISSN | 2165-8773 |
| EndPage | 121 |
| ExternalDocumentID | 10148475 |
| Genre | orig-research |
| GroupedDBID | 6IE 6IL 6IN AAWTH ABLEC ACGFS ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK M43 OCL RIE RIL |
| ID | FETCH-LOGICAL-i204t-fb8198cae3a54e07d372dbd95dbe7e67aa3bcb5cf5bf1a299e1f74cdef95b4163 |
| IEDL.DBID | RIE |
| ISICitedReferencesCount | 1 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001016413500014&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| IngestDate | Wed Aug 27 02:56:22 EDT 2025 |
| IsPeerReviewed | false |
| IsScholarly | true |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-i204t-fb8198cae3a54e07d372dbd95dbe7e67aa3bcb5cf5bf1a299e1f74cdef95b4163 |
| PageCount | 10 |
| ParticipantIDs | ieee_primary_10148475 |
| PublicationCentury | 2000 |
| PublicationDate | 2023-April |
| PublicationDateYYYYMMDD | 2023-04-01 |
| PublicationDate_xml | – month: 04 year: 2023 text: 2023-April |
| PublicationDecade | 2020 |
| PublicationTitle | IEEE Pacific Visualization Symposium |
| PublicationTitleAbbrev | PACIFICVIS |
| PublicationYear | 2023 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| SSID | ssj0000941586 |
| Score | 2.2223191 |
| Snippet | ML (Machine learning) models are used to mine inconspicuous information in big data. The model and data quality influence the performance of a machine-learning... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 112 |
| SubjectTerms | Human-centered computing Visualization Visualization systems and tools |
| Title | How Can We Improve Data Quality for Machine Learning? A Visual Analytics System using Data and Process-driven Strategies |
| URI | https://ieeexplore.ieee.org/document/10148475 |
| WOSCitedRecordID | wos001016413500014&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlZ25TgMxEIYtiCig4Qri1hS0S_aw17sVQoEoDVEKjnSRj3GUZoOSDcfb42MT0lDQWdbKsmx5Zuz95xtCbnicC7R2PypjpiIqhW1J6SiimmPGJVWm8MUm-GBQjEblsElW97kwiOjFZ3jrmv5fvp6ppXsq67i6staasm2yzXkekrXWDyr2npKwIm_AoklcdhpZ2-t0wfIyc3qE1OFMY1fbe6OUivckvf1_zuGAtH9z8mC49jaHZAurI7K3gRM8Jl_92Sd0RQVvCOGxAOFB1AICKOMbbIAKT149idCAVSd3cA927vYL8IASh22GwDEHJ4qfhCFEpaFJKoj03JlIWIFtcdEmL73H524_aiorRNM0pnVkpA0ECiUwE4xizHXGUy11ybREjjkXIpNKMmWYNImwHgsTw6nSaEomXQh3QlrVrMJTAkzaI28oN9qebFYYYRjaOymy1NjRqTgjbbeI4_cAzxiv1u_8j_4Lsuv2KYhjLkmrni_xiuyoj3q6mF_7Lf8Bp5yvuA |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlZ07T8MwEMctKEjAwquINzewhuZhx8mEUKEqoq06FOhW-XGuuqSoTXl8e-wkLSwMbJYVWZYt352d__2OkGvuxwKt3fdSnymPSmFbUjqKqOYYcUmVSYpiE7zXS4bDtF8lqxe5MIhYiM_wxjWLf_l6qhbuqazh6spaa8rWyQajNPTLdK3Vk4q9qQQsiSu0aOCnjUrY9jKZsziNnCIhdEBT31X3_lVMpfAlrd1_zmKP1H-y8qC_8jf7ZA2zA7LzCyh4SD7b0w9oigxeEcrnAoR7kQsoURlfYENU6Bb6SYQKrTq-hTuwc7dfQIEoceBmKEnm4GTx43IIkWmo0go8PXNGEpZoW5zXyXPrYdBse1VtBW8S-jT3jLShQKIERoJR9LmOeKilTpmWyDHmQkRSSaYMkyYQ1mdhYDhVGk3KpAvijkgtm2Z4TIBJe-gN5Ubbs80SIwxDeytFFho7OhUnpO4WcfRW4jNGy_U7_aP_imy1B93OqPPYezoj227PSqnMOanlswVekE31nk_ms8ti-78B9LKy_w |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=IEEE+Pacific+Visualization+Symposium&rft.atitle=How+Can+We+Improve+Data+Quality+for+Machine+Learning%3F+A+Visual+Analytics+System+using+Data+and+Process-driven+Strategies&rft.au=Hong%2C+Hyein&rft.au=Yoo%2C+Sangbong&rft.au=Jin%2C+Yejin&rft.au=Jang%2C+Yun&rft.date=2023-04-01&rft.pub=IEEE&rft.eissn=2165-8773&rft.spage=112&rft.epage=121&rft_id=info:doi/10.1109%2FPacificVis56936.2023.00020&rft.externalDocID=10148475 |