A Genetic Programming Approach to Record Deduplication

Several systems that rely on consistent data to offer high-quality services, such as digital libraries and e-commerce brokers, may be affected by the existence of duplicates, quasi replicas, or near-duplicate entries in their repositories. Because of that, there have been significant investments fro...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:IEEE transactions on knowledge and data engineering Ročník 24; číslo 3; s. 399 - 412
Hlavní autoři: de Carvalho, M. G., Laender, A. H. F., Goncalves, M. A., da Silva, A. S.
Médium: Journal Article
Jazyk:angličtina
Vydáno: IEEE 01.03.2012
Témata:
ISSN:1041-4347
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Abstract Several systems that rely on consistent data to offer high-quality services, such as digital libraries and e-commerce brokers, may be affected by the existence of duplicates, quasi replicas, or near-duplicate entries in their repositories. Because of that, there have been significant investments from private and government organizations for developing methods for removing replicas from its data repositories. This is due to the fact that clean and replica-free repositories not only allow the retrieval of higher quality information but also lead to more concise data and to potential savings in computational time and resources to process this data. In this paper, we propose a genetic programming approach to record deduplication that combines several different pieces of evidence extracted from the data content to find a deduplication function that is able to identify whether two entries in a repository are replicas or not. As shown by our experiments, our approach outperforms an existing state-of-the-art method found in the literature. Moreover, the suggested functions are computationally less demanding since they use fewer evidence. In addition, our genetic programming approach is capable of automatically adapting these functions to a given fixed replica identification boundary, freeing the user from the burden of having to choose and tune this parameter.
AbstractList Several systems that rely on consistent data to offer high-quality services, such as digital libraries and e-commerce brokers, may be affected by the existence of duplicates, quasi replicas, or near-duplicate entries in their repositories. Because of that, there have been significant investments from private and government organizations for developing methods for removing replicas from its data repositories. This is due to the fact that clean and replica-free repositories not only allow the retrieval of higher quality information but also lead to more concise data and to potential savings in computational time and resources to process this data. In this paper, we propose a genetic programming approach to record deduplication that combines several different pieces of evidence extracted from the data content to find a deduplication function that is able to identify whether two entries in a repository are replicas or not. As shown by our experiments, our approach outperforms an existing state-of-the-art method found in the literature. Moreover, the suggested functions are computationally less demanding since they use fewer evidence. In addition, our genetic programming approach is capable of automatically adapting these functions to a given fixed replica identification boundary, freeing the user from the burden of having to choose and tune this parameter.
Author da Silva, A. S.
Laender, A. H. F.
Goncalves, M. A.
de Carvalho, M. G.
Author_xml – sequence: 1
  givenname: M. G.
  surname: de Carvalho
  fullname: de Carvalho, M. G.
  email: ext-moises.carvalho@nokia.com
  organization: Nokia INdT, Manaus, Brazil
– sequence: 2
  givenname: A. H. F.
  surname: Laender
  fullname: Laender, A. H. F.
  email: laender@dcc.ufmg.br
  organization: Dept. de Cienc. da Comput., Univ. Fed. de Minas Gerais, Belo Horizonte, Brazil
– sequence: 3
  givenname: M. A.
  surname: Goncalves
  fullname: Goncalves, M. A.
  email: mgoncalv@dcc.ufmg.br
  organization: Dept. de Cienc. da Comput., Univ. Fed. de Minas Gerais, Belo Horizonte, Brazil
– sequence: 4
  givenname: A. S.
  surname: da Silva
  fullname: da Silva, A. S.
  email: alti@icomp.ufam.edu.br
  organization: Inst. de Comput., Univ. Fed. do Amazonas, Manaus, Brazil
BookMark eNp1j71Ow0AQhK8IEkmgpKK5F3C4f59LKwkJIhIIhdq6bNbhkO2zzqbg7bEJokCi2h1pZne-GZk0oUFCbjhbcM6yu_3jar0QbJBCqgmZcqZ4oqRKL8ms694ZYza1fEpMTjfYYO-BPsdwiq6ufXOiedvG4OCN9oG-IIR4pCs8frSVB9f70FyRi9JVHV7_zDl5vV_vl9tk97R5WOa7BIRO-0SAY1JYhmbQWmROKik4h4Ox2qbM2dJkZSpBGTTaAVp5EAjDDq6UiEzOSXK-CzF0XcSyaKOvXfwsOCtG0GIELUbQYgAd_PKPH3z_3biPzlf_pm7PKY-Ivx-0UdoIKb8AmgRinQ
CODEN ITKEEH
CitedBy_id crossref_primary_10_32604_cmc_2023_033995
crossref_primary_10_1109_TETCI_2019_2923426
crossref_primary_10_1136_amiajnl_2013_001744
crossref_primary_10_1007_s12065_019_00263_0
crossref_primary_10_1016_j_is_2012_10_002
crossref_primary_10_1016_j_aei_2025_103538
crossref_primary_10_1016_j_asoc_2015_11_015
crossref_primary_10_1016_j_eswa_2012_08_045
crossref_primary_10_1093_comjnl_bxv052
crossref_primary_10_3390_info13030116
crossref_primary_10_1007_s00521_019_04060_9
crossref_primary_10_1007_s11704_015_5276_6
crossref_primary_10_4018_IJSWIS_2018070107
crossref_primary_10_1007_s11042_022_12248_w
crossref_primary_10_1007_s10710_019_09361_5
crossref_primary_10_1016_j_amc_2013_02_017
crossref_primary_10_1080_0952813X_2020_1785018
crossref_primary_10_1109_ACCESS_2021_3139987
crossref_primary_10_1016_j_eswa_2013_09_040
crossref_primary_10_1016_j_websem_2013_06_001
crossref_primary_10_1007_s12559_020_09739_z
crossref_primary_10_1007_s10489_016_0874_z
crossref_primary_10_1016_j_is_2018_02_005
crossref_primary_10_1007_s00521_016_2379_4
crossref_primary_10_1016_j_is_2018_06_006
crossref_primary_10_1080_17517575_2015_1065513
crossref_primary_10_1016_j_swevo_2017_12_002
crossref_primary_10_1109_TKDE_2015_2416734
Cites_doi 10.1145/352595.352598
10.2307/2286061
10.1145/1148170.1148265
10.1016/S0306-4379(01)00042-4
10.1016/j.patcog.2008.04.010
10.1109/MIS.2003.1234765
10.1109/TKDE.2007.250581
10.1145/872757.872796
10.1145/1363686.1364118
10.1109/2.769447
10.1145/1277741.1277810
10.1145/775047.775087
10.1145/956699.956719
10.1016/B978-012088469-8.50057-7
10.7551/mitpress/1109.003.0004
10.1145/301136.301255
10.1126/science.130.3381.954
10.1007/11508069_15
10.1145/1142473.1142599
10.1145/956750.956759
10.1145/1141753.1141760
10.1016/j.is.2008.07.003
10.1145/1099554.1099688
10.1007/s00778-002-0072-y
10.1007/BFb0055923
10.1145/1008694.1008697
10.1145/775047.775116
10.1145/775047.775099
ContentType Journal Article
DBID 97E
RIA
RIE
AAYXX
CITATION
DOI 10.1109/TKDE.2010.234
DatabaseName IEEE Xplore (IEEE)
IEEE All-Society Periodicals Package (ASPP) 1998–Present
IEEE Xplore
CrossRef
DatabaseTitle CrossRef
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
Computer Science
EndPage 412
ExternalDocumentID 10_1109_TKDE_2010_234
5645623
Genre orig-research
GroupedDBID -~X
.DC
0R~
1OL
29I
4.4
5GY
5VS
6IK
97E
9M8
AAJGR
AARMG
AASAJ
AAWTH
ABAZT
ABFSI
ABQJQ
ABVLG
ACGFO
ACIWK
AENEX
AETIX
AGQYO
AGSQL
AHBIQ
AI.
AIBXA
AKJIK
AKQYR
ALLEH
ALMA_UNASSIGNED_HOLDINGS
ASUFR
ATWAV
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CS3
DU5
E.L
EBS
EJD
F5P
HZ~
H~9
ICLAB
IEDLZ
IFIPE
IFJZH
IPLJI
JAVBF
LAI
M43
MS~
O9-
OCL
P2P
PQQKQ
RIA
RIE
RNI
RNS
RXW
RZB
TAE
TAF
TN5
UHB
VH1
AAYXX
CITATION
ID FETCH-LOGICAL-c257t-2ca03280e6c25529a343211cb685870a8f69f73c46e65ace83b2ece65caf3ee03
IEDL.DBID RIE
ISICitedReferencesCount 45
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000299384600002&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 1041-4347
IngestDate Tue Nov 18 22:35:39 EST 2025
Sat Nov 29 08:07:08 EST 2025
Wed Aug 27 02:52:17 EDT 2025
IsPeerReviewed true
IsScholarly true
Issue 3
Language English
License https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c257t-2ca03280e6c25529a343211cb685870a8f69f73c46e65ace83b2ece65caf3ee03
PageCount 14
ParticipantIDs crossref_citationtrail_10_1109_TKDE_2010_234
crossref_primary_10_1109_TKDE_2010_234
ieee_primary_5645623
PublicationCentury 2000
PublicationDate 2012-03-01
PublicationDateYYYYMMDD 2012-03-01
PublicationDate_xml – month: 03
  year: 2012
  text: 2012-03-01
  day: 01
PublicationDecade 2010
PublicationTitle IEEE transactions on knowledge and data engineering
PublicationTitleAbbrev TKDE
PublicationYear 2012
Publisher IEEE
Publisher_xml – name: IEEE
References ref13
ref12
ref15
ref14
ref31
ref30
ref11
ref33
ref10
ref32
Baeza-Yates (ref22) 1999
ref2
ref17
ref16
ref19
ref18
Koza (ref8) 1992
(ref26) 2011
ref24
Wheatley (ref1) 2004
ref23
Bell (ref7) 2006
ref25
ref20
ref21
Banzhaf (ref9) 1998
ref28
ref27
ref29
ref4
ref3
ref6
ref5
References_xml – ident: ref23
  doi: 10.1145/352595.352598
– ident: ref5
  doi: 10.2307/2286061
– year: 2006
  ident: ref7
  article-title: Is You Data Dirty? and Does that Matter?
– ident: ref14
  doi: 10.1145/1148170.1148265
– ident: ref28
  doi: 10.1016/S0306-4379(01)00042-4
– ident: ref13
  doi: 10.1016/j.patcog.2008.04.010
– volume-title: Freely Extensible Biomedical Record Linkage
  year: 2011
  ident: ref26
– ident: ref17
  doi: 10.1109/MIS.2003.1234765
– ident: ref21
  doi: 10.1109/TKDE.2007.250581
– volume-title: Modern Information Retrieval
  year: 1999
  ident: ref22
– ident: ref3
  doi: 10.1145/872757.872796
– ident: ref16
  doi: 10.1145/1363686.1364118
– ident: ref20
  doi: 10.1109/2.769447
– volume-title: CIO Asia Magazine
  year: 2004
  ident: ref1
  article-title: Operation Clean Data
– ident: ref10
  doi: 10.1145/1277741.1277810
– ident: ref33
  doi: 10.1145/775047.775087
– ident: ref24
  doi: 10.1145/956699.956719
– ident: ref29
  doi: 10.1016/B978-012088469-8.50057-7
– ident: ref30
  doi: 10.7551/mitpress/1109.003.0004
– ident: ref19
  doi: 10.1145/301136.301255
– ident: ref25
  doi: 10.1126/science.130.3381.954
– ident: ref32
  doi: 10.1007/11508069_15
– ident: ref2
  doi: 10.1145/1142473.1142599
– ident: ref18
  doi: 10.1145/956750.956759
– ident: ref15
  doi: 10.1145/1141753.1141760
– ident: ref11
  doi: 10.1016/j.is.2008.07.003
– ident: ref12
  doi: 10.1145/1099554.1099688
– volume-title: Gentic Programming: On the Programming of Computers by Means of Natural Selection
  year: 1992
  ident: ref8
– ident: ref6
  doi: 10.1007/s00778-002-0072-y
– volume-title: Genetic Programming - An Introduction: On the Automatic Evolution of Computer Programs and Its Applications
  year: 1998
  ident: ref9
  doi: 10.1007/BFb0055923
– ident: ref4
  doi: 10.1145/1008694.1008697
– ident: ref27
  doi: 10.1145/775047.775116
– ident: ref31
  doi: 10.1145/775047.775099
SSID ssj0008781
Score 2.3144853
Snippet Several systems that rely on consistent data to offer high-quality services, such as digital libraries and e-commerce brokers, may be affected by the existence...
SourceID crossref
ieee
SourceType Enrichment Source
Index Database
Publisher
StartPage 399
SubjectTerms Data mining
Database administration
database integration
evolutionary computing and genetic algorithms
Genetic programming
Machine learning
Probabilistic logic
Training
Title A Genetic Programming Approach to Record Deduplication
URI https://ieeexplore.ieee.org/document/5645623
Volume 24
WOSCitedRecordID wos000299384600002&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVIEE
  databaseName: IEEE Electronic Library (IEL)
  issn: 1041-4347
  databaseCode: RIE
  dateStart: 19890101
  customDbUrl:
  isFulltext: true
  dateEnd: 99991231
  titleUrlDefault: https://ieeexplore.ieee.org/
  omitProxy: false
  ssIdentifier: ssj0008781
  providerName: IEEE
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LT8MwDLa2iQMcGGwgxks5IE4r6zNpjxPbhASadhhotyoNjoQEKxodvx8n7cqQ4MAtjWypcpI6X21_BrgSXAvlRtpJAoydUMvEkbEnHU5ogy7ESaQtgenTg5hO48UimTWgX9fCIKJNPsMbM7Sx_Odcrc2vsoFhPiF33YSmELys1aq_urGwDUkJXRAmCkLxzac5mN-PxmUSlx-EP_zPVkMV608m7f-9yQHsV_dGNiwX-hAauOxAe9OTgVVHtAN7WwSDXeBDZnilSYfNykSsN5pnw4pInBU5K_EnG5HLqUPZR_A4Gc9v75yqU4Kj6MgVjq-k4cVzkdNz5CfSlIt6nsoMu7xwZax5okWgQo48kgrjIPNR0VhJHSC6wTG0lvkST4BxE29XystERt6bli10MzcjVEdKiiBsD_ob-6WqohE33SxeUwsn3CQ15k6NuVMydw-ua_H3kj_jL8GuMXMtVFn49PfpM9glRb9MBjuHVrFa4wXsqM_i5WN1aXfGFwabtJc
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3PT8IwFH5BNFEPoqARf_ZgPFHpfrXbkSgEAxIOaLgtXe0SEwWDw7_f125MTPTgrWtel-Xrutdvfe97AFeCp0KxIKWRp0PqpzKiMnQk5cg2cEMcBakVMH0aitEonE6jcQVaZS6M1toGn-kb07Rn-c9ztTS_ytpG-QTd9QZsBr7vsjxbq_zuhsKWJEV-gazI88W3omZ7Mrjr5mFcruf_8EBrJVWsR-nV_vcs-7BX7BxJJ5_qA6joWR1qq6oMpFikddhdkxhsAO8QoyyNY8g4D8V6w37SKaTESTYnOQMld-h0ysPsQ3jsdSe3fVrUSqAKF11GXSWNMh7THK8DN5ImYdRxVGL05QWTYcqjVHjK55oHUunQS1ytsK1k6mnNvCOozuYzfQyEmxN3pZxEJOi_ceJ8lrAEeR0OUkhim9Ba4RerQkjc1LN4jS2hYFFs4I4N3DHC3YTr0vw9V9D4y7BhYC6NCoRPfu--hO3-5GEYD-9Hg1PYwZu4eWjYGVSzxVKfw5b6zF4-Fhf2LfkCmyC33g
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=A+Genetic+Programming+Approach+to+Record+Deduplication&rft.jtitle=IEEE+transactions+on+knowledge+and+data+engineering&rft.au=de+Carvalho%2C+Moises+G.&rft.au=Laender%2C+Alberto+H.+F.&rft.au=Goncalves%2C+Marcos+Andre&rft.au=da+Silva%2C+Altigran+S.&rft.date=2012-03-01&rft.issn=1041-4347&rft.volume=24&rft.issue=3&rft.spage=399&rft.epage=412&rft_id=info:doi/10.1109%2FTKDE.2010.234&rft.externalDBID=n%2Fa&rft.externalDocID=10_1109_TKDE_2010_234
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1041-4347&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1041-4347&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1041-4347&client=summon