Active Learning Genetic programming for record deduplication

The great majority of genetic programming (GP) algorithms that deal with the classification problem follow a supervised approach, i.e., they consider that all fitness cases available to evaluate their models are labeled. However, in certain application domains, a lot of human effort is required to l...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE Congress on Evolutionary Computation S. 1 - 8
Hauptverfasser: de Freitas, Junio, Pappa, Gisele L., da Silva, Altigran S., Gonccalves, Marcos A., Moura, Edleno, Veloso, Adriano, Laender, Alberto H.F., de Carvalho, Moises G.
Format: Tagungsbericht
Sprache:Englisch
Veröffentlicht: IEEE 01.07.2010
Schlagworte:
ISBN:1424469090, 9781424469093
ISSN:1089-778X
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Abstract The great majority of genetic programming (GP) algorithms that deal with the classification problem follow a supervised approach, i.e., they consider that all fitness cases available to evaluate their models are labeled. However, in certain application domains, a lot of human effort is required to label training data, and methods following a semi-supervised approach might be more appropriate. This is because they significantly reduce the time required for data labeling while maintaining acceptable accuracy rates. This paper presents the Active Learning GP (AGP), a semi-supervised GP, and instantiates it for the data deduplication problem. AGP uses an active learning approach in which a committee of multi-attribute functions votes for classifying record pairs as duplicates or not. When the committee majority voting is not enough to predict the class of the data pairs, a user is called to solve the conflict. The method was applied to three datasets and compared to two other deduplication methods. Results show that AGP guarantees the quality of the deduplication while reducing the number of labeled examples needed.
AbstractList The great majority of genetic programming (GP) algorithms that deal with the classification problem follow a supervised approach, i.e., they consider that all fitness cases available to evaluate their models are labeled. However, in certain application domains, a lot of human effort is required to label training data, and methods following a semi-supervised approach might be more appropriate. This is because they significantly reduce the time required for data labeling while maintaining acceptable accuracy rates. This paper presents the Active Learning GP (AGP), a semi-supervised GP, and instantiates it for the data deduplication problem. AGP uses an active learning approach in which a committee of multi-attribute functions votes for classifying record pairs as duplicates or not. When the committee majority voting is not enough to predict the class of the data pairs, a user is called to solve the conflict. The method was applied to three datasets and compared to two other deduplication methods. Results show that AGP guarantees the quality of the deduplication while reducing the number of labeled examples needed.
Author da Silva, Altigran S.
Veloso, Adriano
de Freitas, Junio
Gonccalves, Marcos A.
Moura, Edleno
Pappa, Gisele L.
Laender, Alberto H.F.
de Carvalho, Moises G.
Author_xml – sequence: 1
  givenname: Junio
  surname: de Freitas
  fullname: de Freitas, Junio
  email: jusf.@dcc.ufam.edu.br
  organization: Computer Science Department, Federal University of Amazonas - Brazil
– sequence: 2
  givenname: Gisele L.
  surname: Pappa
  fullname: Pappa, Gisele L.
  email: glpappa@dcc.ufmg.br
  organization: Computer Science Department, Federal University of Minas Gerais - Brazil
– sequence: 3
  givenname: Altigran S.
  surname: da Silva
  fullname: da Silva, Altigran S.
  email: alti@dcc.ufam.edu.br
  organization: Computer Science Department, Federal University of Amazonas - Brazil
– sequence: 4
  givenname: Marcos A.
  surname: Gonccalves
  fullname: Gonccalves, Marcos A.
  organization: Computer Science Department, Federal University of Minas Gerais - Brazil
– sequence: 5
  givenname: Edleno
  surname: Moura
  fullname: Moura, Edleno
  email: edleno.@dcc.ufam.edu.br
  organization: Computer Science Department, Federal University of Amazonas - Brazil
– sequence: 6
  givenname: Adriano
  surname: Veloso
  fullname: Veloso, Adriano
  email: adrianov@dcc.ufmg.br
  organization: Computer Science Department, Federal University of Minas Gerais - Brazil
– sequence: 7
  givenname: Alberto H.F.
  surname: Laender
  fullname: Laender, Alberto H.F.
  email: laender@dcc.ufmg.br
  organization: Computer Science Department, Federal University of Minas Gerais - Brazil
– sequence: 8
  givenname: Moises G.
  surname: de Carvalho
  fullname: de Carvalho, Moises G.
  email: moises@dcc.ufmg.br
  organization: Comput. Sci. Dept., Fed. Univ. of Minas Gerais, Belo Horizonte, Brazil
BookMark eNo9j01Lw0AURUesYFPdC27yB1LfZN5MZsBNCW0VAm66cFeSmTdlpPlgEgX_vRGLq8u5i3O5CVt0fUeMPXBYcw7mqdyW6xxmklIrDnjFEo45ojIzXP8DGFiwJQdtsqLQ77csGccPAI6SmyV73tgpfFFaUR270J3SPXU0BZsOsT_Fum1_O9_HNJLto0sduc_hHGw9hb67Yze-Po90f8kVO-y2h_Ilq972r-WmyoKBKaMm9_MWysI5o61EQQ6104I7r5QrCum5Qi9Eo7xT2oNoNAFKbARa5ZVYscc_bSCi4xBDW8fv4-W1-AGR_Erz
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/CEC.2010.5586104
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
Computer Science
EISBN 1424469104
9781424469109
1424469112
9781424469116
EndPage 8
ExternalDocumentID 5586104
Genre orig-research
GroupedDBID -~X
.DC
0R~
29I
4.4
5GY
5VS
6IE
6IF
6IK
6IL
6IN
97E
AAJGR
AARMG
AASAJ
AAWTH
ABAZT
ABJNI
ABQJQ
ABVLG
ACGFO
ACGFS
ACIWK
ADZIZ
AENEX
AETIX
AGQYO
AGSQL
AHBIQ
AI.
AIBXA
AKJIK
AKQYR
ALLEH
ALMA_UNASSIGNED_HOLDINGS
ATWAV
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
CS3
EBS
EJD
HZ~
H~9
IEGSK
IFIPE
IFJZH
IPLJI
JAVBF
LAI
M43
O9-
OCL
P2P
PQQKQ
RIA
RIE
RIL
RNS
TN5
VH1
ID FETCH-LOGICAL-i90t-eb2f519457dd98c543ed48d831df66d775f164f33b6fd68f03b8e0454b34c6f63
IEDL.DBID RIE
ISBN 1424469090
9781424469093
ISSN 1089-778X
IngestDate Wed Aug 27 06:00:11 EDT 2025
IsPeerReviewed true
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i90t-eb2f519457dd98c543ed48d831df66d775f164f33b6fd68f03b8e0454b34c6f63
PageCount 8
ParticipantIDs ieee_primary_5586104
PublicationCentury 2000
PublicationDate 2010-July
PublicationDateYYYYMMDD 2010-07-01
PublicationDate_xml – month: 07
  year: 2010
  text: 2010-July
PublicationDecade 2010
PublicationTitle IEEE Congress on Evolutionary Computation
PublicationTitleAbbrev CEC
PublicationYear 2010
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0014519
ssj0000451938
Score 1.8589369
Snippet The great majority of genetic programming (GP) algorithms that deal with the classification problem follow a supervised approach, i.e., they consider that all...
SourceID ieee
SourceType Publisher
StartPage 1
SubjectTerms Decision support systems
Genetic programming
Labeling
Learning
Training
Training data
Title Active Learning Genetic programming for record deduplication
URI https://ieeexplore.ieee.org/document/5586104
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LTwIxEJ4A8aAXFDC-04NHK0tm-0q8GALxYAgHDtzIbh-Gg2AQ_P2227Jo4sXbtknTzUzbeX8DcO-YHZRYaMq4RpprLmlpJVJrdF5k1svAyOlXMZnI-VxNG_BQ18JYa6vkM_sYPqtYvlnrXXCV9RmTXtrnTWgKwWOtVu1PqXBS8BBBCMOYXK-8Binn-6Iubw2qGuspjXEfv8xUfzgaxoSvtNmvriuV0Bm3__e7p9A7VO-RaS2XzqBhVx1o79s3kHSbO3DyA4uwC0_P1ctHEuDqGwmA1P5UkZTB9R7mvIZLoluHGC-j6th3D2bj0Wz4QlNrBbpU2ZZ6c9p5-uRMGKOkZjlak0sjcWAc50YI5rwZ5RBL7gyXLsNS2gDWV6Jnp-N4Dq3VemUvgKD2Ko4qMmnz8BjIwmnMCkS_zAi_6BK6gTKLjwiesUhEufp7-hqOY3g-5MPeQGu72dlbONJf2-Xn5q7i-DfRNaSa
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LTwIxEJ4gmqgXFDC-7cGjleK0u93EiyEQjEg4cOBGdvswHASD4O-33S2LJl68bZs03cy0nfc3ALdWmHaGqaIiUki5iiTNjERqtOIpM04GFpwexMOhnEySUQXuyloYY0yefGbu_Wcey9cLtfauspYQ0kl7vgO7gvMHVlRrlR6VHCkFtzEEPyzS6xOnQ8rJpqzL2YNJifYUxriJYLKk1el2ipSvsN2vviu52OnV_vfDR9Dc1u-RUSmZjqFi5nWobRo4kHCf63D4A42wAY9P-dtHAuTqG_GQ1O5ckZDD9e7nnI5LCscO0U5KldHvJox73XGnT0NzBTpL2Io6g9o6-nARa51IJTgazaWW2NY2inQcC-sMKYuYRVZH0jLMpPFwfRk6htoIT6A6X8zNKRBUTslJUiYN98-BTK1CliK6ZTp2i86g4Skz_SjgM6aBKOd_T9_Afn_8OpgOnocvF3BQBOt9duwlVFfLtbmCPfW1mn0ur3PufwOY5Kfh
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=IEEE+Congress+on+Evolutionary+Computation&rft.atitle=Active+Learning+Genetic+programming+for+record+deduplication&rft.au=de+Freitas%2C+Junio&rft.au=Pappa%2C+Gisele+L.&rft.au=da+Silva%2C+Altigran+S.&rft.au=Gonccalves%2C+Marcos+A.&rft.date=2010-07-01&rft.pub=IEEE&rft.isbn=9781424469093&rft.issn=1089-778X&rft.spage=1&rft.epage=8&rft_id=info:doi/10.1109%2FCEC.2010.5586104&rft.externalDocID=5586104
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1089-778X&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1089-778X&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1089-778X&client=summon