A Genetic Programming Approach to Record Deduplication
Several systems that rely on consistent data to offer high-quality services, such as digital libraries and e-commerce brokers, may be affected by the existence of duplicates, quasi replicas, or near-duplicate entries in their repositories. Because of that, there have been significant investments fro...
Uloženo v:
| Vydáno v: | IEEE transactions on knowledge and data engineering Ročník 24; číslo 3; s. 399 - 412 |
|---|---|
| Hlavní autoři: | , , , |
| Médium: | Journal Article |
| Jazyk: | angličtina |
| Vydáno: |
IEEE
01.03.2012
|
| Témata: | |
| ISSN: | 1041-4347 |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Abstract | Several systems that rely on consistent data to offer high-quality services, such as digital libraries and e-commerce brokers, may be affected by the existence of duplicates, quasi replicas, or near-duplicate entries in their repositories. Because of that, there have been significant investments from private and government organizations for developing methods for removing replicas from its data repositories. This is due to the fact that clean and replica-free repositories not only allow the retrieval of higher quality information but also lead to more concise data and to potential savings in computational time and resources to process this data. In this paper, we propose a genetic programming approach to record deduplication that combines several different pieces of evidence extracted from the data content to find a deduplication function that is able to identify whether two entries in a repository are replicas or not. As shown by our experiments, our approach outperforms an existing state-of-the-art method found in the literature. Moreover, the suggested functions are computationally less demanding since they use fewer evidence. In addition, our genetic programming approach is capable of automatically adapting these functions to a given fixed replica identification boundary, freeing the user from the burden of having to choose and tune this parameter. |
|---|---|
| AbstractList | Several systems that rely on consistent data to offer high-quality services, such as digital libraries and e-commerce brokers, may be affected by the existence of duplicates, quasi replicas, or near-duplicate entries in their repositories. Because of that, there have been significant investments from private and government organizations for developing methods for removing replicas from its data repositories. This is due to the fact that clean and replica-free repositories not only allow the retrieval of higher quality information but also lead to more concise data and to potential savings in computational time and resources to process this data. In this paper, we propose a genetic programming approach to record deduplication that combines several different pieces of evidence extracted from the data content to find a deduplication function that is able to identify whether two entries in a repository are replicas or not. As shown by our experiments, our approach outperforms an existing state-of-the-art method found in the literature. Moreover, the suggested functions are computationally less demanding since they use fewer evidence. In addition, our genetic programming approach is capable of automatically adapting these functions to a given fixed replica identification boundary, freeing the user from the burden of having to choose and tune this parameter. |
| Author | da Silva, A. S. Laender, A. H. F. Goncalves, M. A. de Carvalho, M. G. |
| Author_xml | – sequence: 1 givenname: M. G. surname: de Carvalho fullname: de Carvalho, M. G. email: ext-moises.carvalho@nokia.com organization: Nokia INdT, Manaus, Brazil – sequence: 2 givenname: A. H. F. surname: Laender fullname: Laender, A. H. F. email: laender@dcc.ufmg.br organization: Dept. de Cienc. da Comput., Univ. Fed. de Minas Gerais, Belo Horizonte, Brazil – sequence: 3 givenname: M. A. surname: Goncalves fullname: Goncalves, M. A. email: mgoncalv@dcc.ufmg.br organization: Dept. de Cienc. da Comput., Univ. Fed. de Minas Gerais, Belo Horizonte, Brazil – sequence: 4 givenname: A. S. surname: da Silva fullname: da Silva, A. S. email: alti@icomp.ufam.edu.br organization: Inst. de Comput., Univ. Fed. do Amazonas, Manaus, Brazil |
| BookMark | eNp1j71Ow0AQhK8IEkmgpKK5F3C4f59LKwkJIhIIhdq6bNbhkO2zzqbg7bEJokCi2h1pZne-GZk0oUFCbjhbcM6yu_3jar0QbJBCqgmZcqZ4oqRKL8ms694ZYza1fEpMTjfYYO-BPsdwiq6ufXOiedvG4OCN9oG-IIR4pCs8frSVB9f70FyRi9JVHV7_zDl5vV_vl9tk97R5WOa7BIRO-0SAY1JYhmbQWmROKik4h4Ox2qbM2dJkZSpBGTTaAVp5EAjDDq6UiEzOSXK-CzF0XcSyaKOvXfwsOCtG0GIELUbQYgAd_PKPH3z_3biPzlf_pm7PKY-Ivx-0UdoIKb8AmgRinQ |
| CODEN | ITKEEH |
| CitedBy_id | crossref_primary_10_32604_cmc_2023_033995 crossref_primary_10_1109_TETCI_2019_2923426 crossref_primary_10_1136_amiajnl_2013_001744 crossref_primary_10_1007_s12065_019_00263_0 crossref_primary_10_1016_j_is_2012_10_002 crossref_primary_10_1016_j_aei_2025_103538 crossref_primary_10_1016_j_asoc_2015_11_015 crossref_primary_10_1016_j_eswa_2012_08_045 crossref_primary_10_1093_comjnl_bxv052 crossref_primary_10_3390_info13030116 crossref_primary_10_1007_s00521_019_04060_9 crossref_primary_10_1007_s11704_015_5276_6 crossref_primary_10_4018_IJSWIS_2018070107 crossref_primary_10_1007_s11042_022_12248_w crossref_primary_10_1007_s10710_019_09361_5 crossref_primary_10_1016_j_amc_2013_02_017 crossref_primary_10_1080_0952813X_2020_1785018 crossref_primary_10_1109_ACCESS_2021_3139987 crossref_primary_10_1016_j_eswa_2013_09_040 crossref_primary_10_1016_j_websem_2013_06_001 crossref_primary_10_1007_s12559_020_09739_z crossref_primary_10_1007_s10489_016_0874_z crossref_primary_10_1016_j_is_2018_02_005 crossref_primary_10_1007_s00521_016_2379_4 crossref_primary_10_1016_j_is_2018_06_006 crossref_primary_10_1080_17517575_2015_1065513 crossref_primary_10_1016_j_swevo_2017_12_002 crossref_primary_10_1109_TKDE_2015_2416734 |
| Cites_doi | 10.1145/352595.352598 10.2307/2286061 10.1145/1148170.1148265 10.1016/S0306-4379(01)00042-4 10.1016/j.patcog.2008.04.010 10.1109/MIS.2003.1234765 10.1109/TKDE.2007.250581 10.1145/872757.872796 10.1145/1363686.1364118 10.1109/2.769447 10.1145/1277741.1277810 10.1145/775047.775087 10.1145/956699.956719 10.1016/B978-012088469-8.50057-7 10.7551/mitpress/1109.003.0004 10.1145/301136.301255 10.1126/science.130.3381.954 10.1007/11508069_15 10.1145/1142473.1142599 10.1145/956750.956759 10.1145/1141753.1141760 10.1016/j.is.2008.07.003 10.1145/1099554.1099688 10.1007/s00778-002-0072-y 10.1007/BFb0055923 10.1145/1008694.1008697 10.1145/775047.775116 10.1145/775047.775099 |
| ContentType | Journal Article |
| DBID | 97E RIA RIE AAYXX CITATION |
| DOI | 10.1109/TKDE.2010.234 |
| DatabaseName | IEEE Xplore (IEEE) IEEE All-Society Periodicals Package (ASPP) 1998–Present IEEE Xplore CrossRef |
| DatabaseTitle | CrossRef |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Engineering Computer Science |
| EndPage | 412 |
| ExternalDocumentID | 10_1109_TKDE_2010_234 5645623 |
| Genre | orig-research |
| GroupedDBID | -~X .DC 0R~ 1OL 29I 4.4 5GY 5VS 6IK 97E 9M8 AAJGR AARMG AASAJ AAWTH ABAZT ABFSI ABQJQ ABVLG ACGFO ACIWK AENEX AETIX AGQYO AGSQL AHBIQ AI. AIBXA AKJIK AKQYR ALLEH ALMA_UNASSIGNED_HOLDINGS ASUFR ATWAV BEFXN BFFAM BGNUA BKEBE BPEOZ CS3 DU5 E.L EBS EJD F5P HZ~ H~9 ICLAB IEDLZ IFIPE IFJZH IPLJI JAVBF LAI M43 MS~ O9- OCL P2P PQQKQ RIA RIE RNI RNS RXW RZB TAE TAF TN5 UHB VH1 AAYXX CITATION |
| ID | FETCH-LOGICAL-c257t-2ca03280e6c25529a343211cb685870a8f69f73c46e65ace83b2ece65caf3ee03 |
| IEDL.DBID | RIE |
| ISICitedReferencesCount | 45 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000299384600002&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 1041-4347 |
| IngestDate | Tue Nov 18 22:35:39 EST 2025 Sat Nov 29 08:07:08 EST 2025 Wed Aug 27 02:52:17 EDT 2025 |
| IsPeerReviewed | true |
| IsScholarly | true |
| Issue | 3 |
| Language | English |
| License | https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-c257t-2ca03280e6c25529a343211cb685870a8f69f73c46e65ace83b2ece65caf3ee03 |
| PageCount | 14 |
| ParticipantIDs | crossref_citationtrail_10_1109_TKDE_2010_234 crossref_primary_10_1109_TKDE_2010_234 ieee_primary_5645623 |
| PublicationCentury | 2000 |
| PublicationDate | 2012-03-01 |
| PublicationDateYYYYMMDD | 2012-03-01 |
| PublicationDate_xml | – month: 03 year: 2012 text: 2012-03-01 day: 01 |
| PublicationDecade | 2010 |
| PublicationTitle | IEEE transactions on knowledge and data engineering |
| PublicationTitleAbbrev | TKDE |
| PublicationYear | 2012 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| References | ref13 ref12 ref15 ref14 ref31 ref30 ref11 ref33 ref10 ref32 Baeza-Yates (ref22) 1999 ref2 ref17 ref16 ref19 ref18 Koza (ref8) 1992 (ref26) 2011 ref24 Wheatley (ref1) 2004 ref23 Bell (ref7) 2006 ref25 ref20 ref21 Banzhaf (ref9) 1998 ref28 ref27 ref29 ref4 ref3 ref6 ref5 |
| References_xml | – ident: ref23 doi: 10.1145/352595.352598 – ident: ref5 doi: 10.2307/2286061 – year: 2006 ident: ref7 article-title: Is You Data Dirty? and Does that Matter? – ident: ref14 doi: 10.1145/1148170.1148265 – ident: ref28 doi: 10.1016/S0306-4379(01)00042-4 – ident: ref13 doi: 10.1016/j.patcog.2008.04.010 – volume-title: Freely Extensible Biomedical Record Linkage year: 2011 ident: ref26 – ident: ref17 doi: 10.1109/MIS.2003.1234765 – ident: ref21 doi: 10.1109/TKDE.2007.250581 – volume-title: Modern Information Retrieval year: 1999 ident: ref22 – ident: ref3 doi: 10.1145/872757.872796 – ident: ref16 doi: 10.1145/1363686.1364118 – ident: ref20 doi: 10.1109/2.769447 – volume-title: CIO Asia Magazine year: 2004 ident: ref1 article-title: Operation Clean Data – ident: ref10 doi: 10.1145/1277741.1277810 – ident: ref33 doi: 10.1145/775047.775087 – ident: ref24 doi: 10.1145/956699.956719 – ident: ref29 doi: 10.1016/B978-012088469-8.50057-7 – ident: ref30 doi: 10.7551/mitpress/1109.003.0004 – ident: ref19 doi: 10.1145/301136.301255 – ident: ref25 doi: 10.1126/science.130.3381.954 – ident: ref32 doi: 10.1007/11508069_15 – ident: ref2 doi: 10.1145/1142473.1142599 – ident: ref18 doi: 10.1145/956750.956759 – ident: ref15 doi: 10.1145/1141753.1141760 – ident: ref11 doi: 10.1016/j.is.2008.07.003 – ident: ref12 doi: 10.1145/1099554.1099688 – volume-title: Gentic Programming: On the Programming of Computers by Means of Natural Selection year: 1992 ident: ref8 – ident: ref6 doi: 10.1007/s00778-002-0072-y – volume-title: Genetic Programming - An Introduction: On the Automatic Evolution of Computer Programs and Its Applications year: 1998 ident: ref9 doi: 10.1007/BFb0055923 – ident: ref4 doi: 10.1145/1008694.1008697 – ident: ref27 doi: 10.1145/775047.775116 – ident: ref31 doi: 10.1145/775047.775099 |
| SSID | ssj0008781 |
| Score | 2.3144853 |
| Snippet | Several systems that rely on consistent data to offer high-quality services, such as digital libraries and e-commerce brokers, may be affected by the existence... |
| SourceID | crossref ieee |
| SourceType | Enrichment Source Index Database Publisher |
| StartPage | 399 |
| SubjectTerms | Data mining Database administration database integration evolutionary computing and genetic algorithms Genetic programming Machine learning Probabilistic logic Training |
| Title | A Genetic Programming Approach to Record Deduplication |
| URI | https://ieeexplore.ieee.org/document/5645623 |
| Volume | 24 |
| WOSCitedRecordID | wos000299384600002&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVIEE databaseName: IEEE Electronic Library (IEL) issn: 1041-4347 databaseCode: RIE dateStart: 19890101 customDbUrl: isFulltext: true dateEnd: 99991231 titleUrlDefault: https://ieeexplore.ieee.org/ omitProxy: false ssIdentifier: ssj0008781 providerName: IEEE |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LT8MwDLa2iQMcGGwgxks5IE4r6zNpjxPbhASadhhotyoNjoQEKxodvx8n7cqQ4MAtjWypcpI6X21_BrgSXAvlRtpJAoydUMvEkbEnHU5ogy7ESaQtgenTg5hO48UimTWgX9fCIKJNPsMbM7Sx_Odcrc2vsoFhPiF33YSmELys1aq_urGwDUkJXRAmCkLxzac5mN-PxmUSlx-EP_zPVkMV608m7f-9yQHsV_dGNiwX-hAauOxAe9OTgVVHtAN7WwSDXeBDZnilSYfNykSsN5pnw4pInBU5K_EnG5HLqUPZR_A4Gc9v75yqU4Kj6MgVjq-k4cVzkdNz5CfSlIt6nsoMu7xwZax5okWgQo48kgrjIPNR0VhJHSC6wTG0lvkST4BxE29XystERt6bli10MzcjVEdKiiBsD_ob-6WqohE33SxeUwsn3CQ15k6NuVMydw-ua_H3kj_jL8GuMXMtVFn49PfpM9glRb9MBjuHVrFa4wXsqM_i5WN1aXfGFwabtJc |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3PT8IwFH5BNFEPoqARf_ZgPFHpfrXbkSgEAxIOaLgtXe0SEwWDw7_f125MTPTgrWtel-Xrutdvfe97AFeCp0KxIKWRp0PqpzKiMnQk5cg2cEMcBakVMH0aitEonE6jcQVaZS6M1toGn-kb07Rn-c9ztTS_ytpG-QTd9QZsBr7vsjxbq_zuhsKWJEV-gazI88W3omZ7Mrjr5mFcruf_8EBrJVWsR-nV_vcs-7BX7BxJJ5_qA6joWR1qq6oMpFikddhdkxhsAO8QoyyNY8g4D8V6w37SKaTESTYnOQMld-h0ysPsQ3jsdSe3fVrUSqAKF11GXSWNMh7THK8DN5ImYdRxVGL05QWTYcqjVHjK55oHUunQS1ytsK1k6mnNvCOozuYzfQyEmxN3pZxEJOi_ceJ8lrAEeR0OUkhim9Ba4RerQkjc1LN4jS2hYFFs4I4N3DHC3YTr0vw9V9D4y7BhYC6NCoRPfu--hO3-5GEYD-9Hg1PYwZu4eWjYGVSzxVKfw5b6zF4-Fhf2LfkCmyC33g |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=A+Genetic+Programming+Approach+to+Record+Deduplication&rft.jtitle=IEEE+transactions+on+knowledge+and+data+engineering&rft.au=de+Carvalho%2C+Moises+G.&rft.au=Laender%2C+Alberto+H.+F.&rft.au=Goncalves%2C+Marcos+Andre&rft.au=da+Silva%2C+Altigran+S.&rft.date=2012-03-01&rft.issn=1041-4347&rft.volume=24&rft.issue=3&rft.spage=399&rft.epage=412&rft_id=info:doi/10.1109%2FTKDE.2010.234&rft.externalDBID=n%2Fa&rft.externalDocID=10_1109_TKDE_2010_234 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1041-4347&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1041-4347&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1041-4347&client=summon |