How Can We Improve Data Quality for Machine Learning? A Visual Analytics System using Data and Process-driven Strategies

ML (Machine learning) models are used to mine inconspicuous information in big data. The model and data quality influence the performance of a machine-learning model. However, it is inefficient to modify the model, which is a black box, and low-quality data tends to cause biased learning of the mode...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE Pacific Visualization Symposium S. 112 - 121
Hauptverfasser: Hong, Hyein, Yoo, Sangbong, Jin, Yejin, Jang, Yun
Format: Tagungsbericht
Sprache:Englisch
Veröffentlicht: IEEE 01.04.2023
Schlagworte:
ISSN:2165-8773
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Abstract ML (Machine learning) models are used to mine inconspicuous information in big data. The model and data quality influence the performance of a machine-learning model. However, it is inefficient to modify the model, which is a black box, and low-quality data tends to cause biased learning of the model. Therefore, it is crucial to improve the data quality. Different techniques have been used to improve data quality depending on the data conditions and the data quality issues. Therefore, improving data quality is time-consuming and challenging for users with insufficient knowledge of data. Visual analytics techniques have been proposed to focus on decision support to improve data quality. However, existing studies are complicated for users to consider a comprehensive DQI (Data Quality Improvement) method for generating data suitable for ML models. Also, it remains limited in that users must directly consider all combinations of DQI processes. This paper presents a novel visual analytics system that manages data quality for use in ML models. The proposed system suggests an optimal quality improvement process with visualization techniques such as heatmap, histogram, and scatter plot to support DQI.
AbstractList ML (Machine learning) models are used to mine inconspicuous information in big data. The model and data quality influence the performance of a machine-learning model. However, it is inefficient to modify the model, which is a black box, and low-quality data tends to cause biased learning of the model. Therefore, it is crucial to improve the data quality. Different techniques have been used to improve data quality depending on the data conditions and the data quality issues. Therefore, improving data quality is time-consuming and challenging for users with insufficient knowledge of data. Visual analytics techniques have been proposed to focus on decision support to improve data quality. However, existing studies are complicated for users to consider a comprehensive DQI (Data Quality Improvement) method for generating data suitable for ML models. Also, it remains limited in that users must directly consider all combinations of DQI processes. This paper presents a novel visual analytics system that manages data quality for use in ML models. The proposed system suggests an optimal quality improvement process with visualization techniques such as heatmap, histogram, and scatter plot to support DQI.
Author Jin, Yejin
Jang, Yun
Hong, Hyein
Yoo, Sangbong
Author_xml – sequence: 1
  givenname: Hyein
  surname: Hong
  fullname: Hong, Hyein
  email: gumdung98@naver.com
  organization: Sejong University,Seoul,South Korea
– sequence: 2
  givenname: Sangbong
  surname: Yoo
  fullname: Yoo, Sangbong
  email: usangbong@gmail.com
  organization: Sejong University,Seoul,South Korea
– sequence: 3
  givenname: Yejin
  surname: Jin
  fullname: Jin, Yejin
  email: yeajin357@gmail.com
  organization: Sejong University,Seoul,South Korea
– sequence: 4
  givenname: Yun
  surname: Jang
  fullname: Jang, Yun
  email: jangy@sejong.edu
  organization: Sejong University,Seoul,South Korea
BookMark eNotzUtPAjEYheFqNBGRf-CicT_Y63S6MgQvkGDE4GVJvul8gzXQIe2Azr-XBFdn8-Q9l-QsNAEJueFsyDmzt3NwvvbuwyedW5kPBRNyyBgT7IQMrLGF1EwKLhQ_JT3Bc50VxsgLMkjp-8CYVVwXeY_8TpofOoZAP5FON9vY7JHeQwv0dQdr33a0biJ9BvflA9IZQgw-rO7oiB6eD4KOAqy71rtEF11qcUN36QCOCQgVncfGYUpZFf0eA120EVpceUxX5LyGdcLB__bJ--PD23iSzV6epuPRLPOCqTary4LbwgFK0AqZqaQRVVlZXZVoMDcAsnSldrUuaw7CWuS1Ua7C2upS8Vz2yfWx6xFxuY1-A7FbcsZVoYyWfwA7ZDw
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/PacificVis56936.2023.00020
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
EISBN 9798350321241
EISSN 2165-8773
EndPage 121
ExternalDocumentID 10148475
Genre orig-research
GroupedDBID 6IE
6IL
6IN
AAWTH
ABLEC
ACGFS
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
M43
OCL
RIE
RIL
ID FETCH-LOGICAL-i204t-fb8198cae3a54e07d372dbd95dbe7e67aa3bcb5cf5bf1a299e1f74cdef95b4163
IEDL.DBID RIE
ISICitedReferencesCount 1
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001016413500014&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate Wed Aug 27 02:56:22 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i204t-fb8198cae3a54e07d372dbd95dbe7e67aa3bcb5cf5bf1a299e1f74cdef95b4163
PageCount 10
ParticipantIDs ieee_primary_10148475
PublicationCentury 2000
PublicationDate 2023-April
PublicationDateYYYYMMDD 2023-04-01
PublicationDate_xml – month: 04
  year: 2023
  text: 2023-April
PublicationDecade 2020
PublicationTitle IEEE Pacific Visualization Symposium
PublicationTitleAbbrev PACIFICVIS
PublicationYear 2023
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0000941586
Score 2.2223191
Snippet ML (Machine learning) models are used to mine inconspicuous information in big data. The model and data quality influence the performance of a machine-learning...
SourceID ieee
SourceType Publisher
StartPage 112
SubjectTerms Human-centered computing
Visualization
Visualization systems and tools
Title How Can We Improve Data Quality for Machine Learning? A Visual Analytics System using Data and Process-driven Strategies
URI https://ieeexplore.ieee.org/document/10148475
WOSCitedRecordID wos001016413500014&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlZ25T8MwFMYtWjHAwlXErTewhiaxHScTQoWqA1QdOLpVPp6rLinqwfHfY8dp6cLAFkVRZNmO37Pzfb9HyLWbEibWLI_yLJcRS1FHUhlv40p1XkirTKDrP4p-Px8Oi0FtVq-8MIhYic_wxl9W__LNVC_9UVnb15V1qylvkIYQWTBrrQ9U3D4l4XlWg0WTuGjXsrbXyZxnBfV6hNTjTGNf23ujlEoVSbp7_2zDPmn9evJgsI42B2QLy0Oyu4ETPCJfvekndGQJbwjhsADhXi4kBFDGN7gEFZ4q9SRCDVYd38IduLa7J6AClHhsMwSOOXhR_Di8QpYGalNBZGZ-iYQV2BbnLfLSfXju9KK6skI0SWO2iKxyiUCuJVLJGcbCUJEaZQpuFArMhJRUacW15com0kUsTKxg2qAtuPIp3DFpltMSTwi4BNPtUAwq6z43FFQyTS1DrjmjhibZKWn5Thy9B3jGaNV_Z3_cPyc7fpyCOOaCNBezJV6Sbf2xmMxnV9WQ_wAAi6-E
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlZ25T8MwFMYtKEjAwlXEzRtYQ5PYzjEhVKiKaKsOBbpVPp6rLilqU47_HjtJCwsDWxRFkWU7fs_O9_0eIdd2SmhfscRLokR4LETlCamdjStUSSqM1CVdvxP3eslwmPYrs3rhhUHEQnyGN-6y-Jevp2rhjsoarq6sXU35OtngjIV-addaHanYnUrAk6hCiwZ-2qiEbS-TOY9S6hQJoQOa-q66969iKkUsae3-sxV7pP7jyoP-Kt7skzXMDsjOL6DgIflsTz-gKTJ4RSiPCxDuRS6gRGV8gU1RoVvoJxEqtOr4Fu7Att0-AQWixIGboSSZg5PFj8tXiExDZSvw9MwtkrBE2-K8Tp5bD4Nm26tqK3iT0Ge5Z6RNBRIlkArO0I81jUMtdcq1xBijWAgqleTKcGkCYWMWBiZmSqNJuXRJ3BGpZdMMjwnYFNPuUTRKYz84jKlgihqGXHFGNQ2iE1J3nTh6K_EZo2X_nf5x_4pstQfdzqjz2Hs6I9tuzEqpzDmp5bMFXpBN9Z5P5rPLYvi_AUufsss
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=IEEE+Pacific+Visualization+Symposium&rft.atitle=How+Can+We+Improve+Data+Quality+for+Machine+Learning%3F+A+Visual+Analytics+System+using+Data+and+Process-driven+Strategies&rft.au=Hong%2C+Hyein&rft.au=Yoo%2C+Sangbong&rft.au=Jin%2C+Yejin&rft.au=Jang%2C+Yun&rft.date=2023-04-01&rft.pub=IEEE&rft.eissn=2165-8773&rft.spage=112&rft.epage=121&rft_id=info:doi/10.1109%2FPacificVis56936.2023.00020&rft.externalDocID=10148475