Active Learning Genetic programming for record deduplication
The great majority of genetic programming (GP) algorithms that deal with the classification problem follow a supervised approach, i.e., they consider that all fitness cases available to evaluate their models are labeled. However, in certain application domains, a lot of human effort is required to l...
Gespeichert in:
| Veröffentlicht in: | IEEE Congress on Evolutionary Computation S. 1 - 8 |
|---|---|
| Hauptverfasser: | , , , , , , , |
| Format: | Tagungsbericht |
| Sprache: | Englisch |
| Veröffentlicht: |
IEEE
01.07.2010
|
| Schlagworte: | |
| ISBN: | 1424469090, 9781424469093 |
| ISSN: | 1089-778X |
| Online-Zugang: | Volltext |
| Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
| Abstract | The great majority of genetic programming (GP) algorithms that deal with the classification problem follow a supervised approach, i.e., they consider that all fitness cases available to evaluate their models are labeled. However, in certain application domains, a lot of human effort is required to label training data, and methods following a semi-supervised approach might be more appropriate. This is because they significantly reduce the time required for data labeling while maintaining acceptable accuracy rates. This paper presents the Active Learning GP (AGP), a semi-supervised GP, and instantiates it for the data deduplication problem. AGP uses an active learning approach in which a committee of multi-attribute functions votes for classifying record pairs as duplicates or not. When the committee majority voting is not enough to predict the class of the data pairs, a user is called to solve the conflict. The method was applied to three datasets and compared to two other deduplication methods. Results show that AGP guarantees the quality of the deduplication while reducing the number of labeled examples needed. |
|---|---|
| AbstractList | The great majority of genetic programming (GP) algorithms that deal with the classification problem follow a supervised approach, i.e., they consider that all fitness cases available to evaluate their models are labeled. However, in certain application domains, a lot of human effort is required to label training data, and methods following a semi-supervised approach might be more appropriate. This is because they significantly reduce the time required for data labeling while maintaining acceptable accuracy rates. This paper presents the Active Learning GP (AGP), a semi-supervised GP, and instantiates it for the data deduplication problem. AGP uses an active learning approach in which a committee of multi-attribute functions votes for classifying record pairs as duplicates or not. When the committee majority voting is not enough to predict the class of the data pairs, a user is called to solve the conflict. The method was applied to three datasets and compared to two other deduplication methods. Results show that AGP guarantees the quality of the deduplication while reducing the number of labeled examples needed. |
| Author | da Silva, Altigran S. Veloso, Adriano de Freitas, Junio Gonccalves, Marcos A. Moura, Edleno Pappa, Gisele L. Laender, Alberto H.F. de Carvalho, Moises G. |
| Author_xml | – sequence: 1 givenname: Junio surname: de Freitas fullname: de Freitas, Junio email: jusf.@dcc.ufam.edu.br organization: Computer Science Department, Federal University of Amazonas - Brazil – sequence: 2 givenname: Gisele L. surname: Pappa fullname: Pappa, Gisele L. email: glpappa@dcc.ufmg.br organization: Computer Science Department, Federal University of Minas Gerais - Brazil – sequence: 3 givenname: Altigran S. surname: da Silva fullname: da Silva, Altigran S. email: alti@dcc.ufam.edu.br organization: Computer Science Department, Federal University of Amazonas - Brazil – sequence: 4 givenname: Marcos A. surname: Gonccalves fullname: Gonccalves, Marcos A. organization: Computer Science Department, Federal University of Minas Gerais - Brazil – sequence: 5 givenname: Edleno surname: Moura fullname: Moura, Edleno email: edleno.@dcc.ufam.edu.br organization: Computer Science Department, Federal University of Amazonas - Brazil – sequence: 6 givenname: Adriano surname: Veloso fullname: Veloso, Adriano email: adrianov@dcc.ufmg.br organization: Computer Science Department, Federal University of Minas Gerais - Brazil – sequence: 7 givenname: Alberto H.F. surname: Laender fullname: Laender, Alberto H.F. email: laender@dcc.ufmg.br organization: Computer Science Department, Federal University of Minas Gerais - Brazil – sequence: 8 givenname: Moises G. surname: de Carvalho fullname: de Carvalho, Moises G. email: moises@dcc.ufmg.br organization: Comput. Sci. Dept., Fed. Univ. of Minas Gerais, Belo Horizonte, Brazil |
| BookMark | eNo9j01Lw0AURUesYFPdC27yB1LfZN5MZsBNCW0VAm66cFeSmTdlpPlgEgX_vRGLq8u5i3O5CVt0fUeMPXBYcw7mqdyW6xxmklIrDnjFEo45ojIzXP8DGFiwJQdtsqLQ77csGccPAI6SmyV73tgpfFFaUR270J3SPXU0BZsOsT_Fum1_O9_HNJLto0sduc_hHGw9hb67Yze-Po90f8kVO-y2h_Ilq972r-WmyoKBKaMm9_MWysI5o61EQQ6104I7r5QrCum5Qi9Eo7xT2oNoNAFKbARa5ZVYscc_bSCi4xBDW8fv4-W1-AGR_Erz |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IL CBEJK RIE RIL |
| DOI | 10.1109/CEC.2010.5586104 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Engineering Computer Science |
| EISBN | 1424469104 9781424469109 1424469112 9781424469116 |
| EndPage | 8 |
| ExternalDocumentID | 5586104 |
| Genre | orig-research |
| GroupedDBID | -~X .DC 0R~ 29I 4.4 5GY 5VS 6IE 6IF 6IK 6IL 6IN 97E AAJGR AARMG AASAJ AAWTH ABAZT ABJNI ABQJQ ABVLG ACGFO ACGFS ACIWK ADZIZ AENEX AETIX AGQYO AGSQL AHBIQ AI. AIBXA AKJIK AKQYR ALLEH ALMA_UNASSIGNED_HOLDINGS ATWAV BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO CS3 EBS EJD HZ~ H~9 IEGSK IFIPE IFJZH IPLJI JAVBF LAI M43 O9- OCL P2P PQQKQ RIA RIE RIL RNS TN5 VH1 |
| ID | FETCH-LOGICAL-i90t-eb2f519457dd98c543ed48d831df66d775f164f33b6fd68f03b8e0454b34c6f63 |
| IEDL.DBID | RIE |
| ISBN | 1424469090 9781424469093 |
| ISSN | 1089-778X |
| IngestDate | Wed Aug 27 06:00:11 EDT 2025 |
| IsPeerReviewed | true |
| IsScholarly | true |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-i90t-eb2f519457dd98c543ed48d831df66d775f164f33b6fd68f03b8e0454b34c6f63 |
| PageCount | 8 |
| ParticipantIDs | ieee_primary_5586104 |
| PublicationCentury | 2000 |
| PublicationDate | 2010-July |
| PublicationDateYYYYMMDD | 2010-07-01 |
| PublicationDate_xml | – month: 07 year: 2010 text: 2010-July |
| PublicationDecade | 2010 |
| PublicationTitle | IEEE Congress on Evolutionary Computation |
| PublicationTitleAbbrev | CEC |
| PublicationYear | 2010 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| SSID | ssj0014519 ssj0000451938 |
| Score | 1.8589369 |
| Snippet | The great majority of genetic programming (GP) algorithms that deal with the classification problem follow a supervised approach, i.e., they consider that all... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 1 |
| SubjectTerms | Decision support systems Genetic programming Labeling Learning Training Training data |
| Title | Active Learning Genetic programming for record deduplication |
| URI | https://ieeexplore.ieee.org/document/5586104 |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LTwIxEJ4A8aAXFDC-04NHK0tm-0q8GALxYAgHDtzIbh-Gg2AQ_P2227Jo4sXbtknTzUzbeX8DcO-YHZRYaMq4RpprLmlpJVJrdF5k1svAyOlXMZnI-VxNG_BQ18JYa6vkM_sYPqtYvlnrXXCV9RmTXtrnTWgKwWOtVu1PqXBS8BBBCMOYXK-8Binn-6Iubw2qGuspjXEfv8xUfzgaxoSvtNmvriuV0Bm3__e7p9A7VO-RaS2XzqBhVx1o79s3kHSbO3DyA4uwC0_P1ctHEuDqGwmA1P5UkZTB9R7mvIZLoluHGC-j6th3D2bj0Wz4QlNrBbpU2ZZ6c9p5-uRMGKOkZjlak0sjcWAc50YI5rwZ5RBL7gyXLsNS2gDWV6Jnp-N4Dq3VemUvgKD2Ko4qMmnz8BjIwmnMCkS_zAi_6BK6gTKLjwiesUhEufp7-hqOY3g-5MPeQGu72dlbONJf2-Xn5q7i-DfRNaSa |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LTwIxEJ4gmqgXFDC-7cGjleK0u93EiyEQjEg4cOBGdvswHASD4O-33S2LJl68bZs03cy0nfc3ALdWmHaGqaIiUki5iiTNjERqtOIpM04GFpwexMOhnEySUQXuyloYY0yefGbu_Wcey9cLtfauspYQ0kl7vgO7gvMHVlRrlR6VHCkFtzEEPyzS6xOnQ8rJpqzL2YNJifYUxriJYLKk1el2ipSvsN2vviu52OnV_vfDR9Dc1u-RUSmZjqFi5nWobRo4kHCf63D4A42wAY9P-dtHAuTqG_GQ1O5ckZDD9e7nnI5LCscO0U5KldHvJox73XGnT0NzBTpL2Io6g9o6-nARa51IJTgazaWW2NY2inQcC-sMKYuYRVZH0jLMpPFwfRk6htoIT6A6X8zNKRBUTslJUiYN98-BTK1CliK6ZTp2i86g4Skz_SjgM6aBKOd_T9_Afn_8OpgOnocvF3BQBOt9duwlVFfLtbmCPfW1mn0ur3PufwOY5Kfh |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=IEEE+Congress+on+Evolutionary+Computation&rft.atitle=Active+Learning+Genetic+programming+for+record+deduplication&rft.au=de+Freitas%2C+Junio&rft.au=Pappa%2C+Gisele+L.&rft.au=da+Silva%2C+Altigran+S.&rft.au=Gonccalves%2C+Marcos+A.&rft.date=2010-07-01&rft.pub=IEEE&rft.isbn=9781424469093&rft.issn=1089-778X&rft.spage=1&rft.epage=8&rft_id=info:doi/10.1109%2FCEC.2010.5586104&rft.externalDocID=5586104 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1089-778X&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1089-778X&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1089-778X&client=summon |

