Unsupervised Entity Resolution With Blocking and Graph Algorithms

Entity resolution identifies all records in a database that refer to the same entity. In this paper, we propose an unsupervised framework for entity resolution using blocking and graph algorithms. The records are partitioned into blocks with no redundancy for efficiency improvement. For intra-block...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:IEEE transactions on knowledge and data engineering Ročník 34; číslo 3; s. 1501 - 1515
Hlavní autori: Zhang, Dongxiang, Li, Dongsheng, Guo, Long, Tan, Kian-Lee
Médium: Journal Article
Jazyk:English
Vydavateľské údaje: New York IEEE 01.03.2022
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Predmet:
ISSN:1041-4347, 1558-2191
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Abstract Entity resolution identifies all records in a database that refer to the same entity. In this paper, we propose an unsupervised framework for entity resolution using blocking and graph algorithms. The records are partitioned into blocks with no redundancy for efficiency improvement. For intra-block data processing, we propose a graph-theoretic fusion framework with two components, namely ITER and CliqueRank. Specifically, ITER constructs a weighted bipartite graph between terms and record-record pairs and iteratively propagates the node salience until convergence. Subsequently, CliqueRank constructs a record graph to estimate the likelihood of two records resident in the same clique. The derived likelihood from CliqueRank is fed back to ITER to rectify the edge weight until a joint optimum can be reached. Experimental evaluation was conducted with 4 real datasets. Results show that our unsupervised framework is comparable or even superior to state-of-the-art deep learning approaches.
AbstractList Entity resolution identifies all records in a database that refer to the same entity. In this paper, we propose an unsupervised framework for entity resolution using blocking and graph algorithms. The records are partitioned into blocks with no redundancy for efficiency improvement. For intra-block data processing, we propose a graph-theoretic fusion framework with two components, namely ITER and CliqueRank. Specifically, ITER constructs a weighted bipartite graph between terms and record-record pairs and iteratively propagates the node salience until convergence. Subsequently, CliqueRank constructs a record graph to estimate the likelihood of two records resident in the same clique. The derived likelihood from CliqueRank is fed back to ITER to rectify the edge weight until a joint optimum can be reached. Experimental evaluation was conducted with 4 real datasets. Results show that our unsupervised framework is comparable or even superior to state-of-the-art deep learning approaches.
Author Zhang, Dongxiang
Li, Dongsheng
Guo, Long
Tan, Kian-Lee
Author_xml – sequence: 1
  givenname: Dongxiang
  orcidid: 0000-0002-9964-2470
  surname: Zhang
  fullname: Zhang, Dongxiang
  email: zhangdongxiang37@gmail.com
  organization: College of Computer Science and Technology, Zhejiang University, Hangzhou, Zhejiang, China
– sequence: 2
  givenname: Dongsheng
  orcidid: 0000-0001-9743-2034
  surname: Li
  fullname: Li, Dongsheng
  email: dsli@nudt.edu.cn
  organization: School of Computer Science, National University of Defense Technology, Changsha, Hunan, China
– sequence: 3
  givenname: Long
  surname: Guo
  fullname: Guo, Long
  email: leo.gl@alibaba-inc.com
  organization: Alibaba Group, Hangzhou, China
– sequence: 4
  givenname: Kian-Lee
  surname: Tan
  fullname: Tan, Kian-Lee
  email: tankl@comp.nus.edu.sg
  organization: School of Computing, National University of Singapore, Singapore
BookMark eNp9kFFLwzAUhYNMcJv-APGl4HNnbtKmyeOcc4oDQTZ8LGmTbpldOpNU2L-3ZeKDDz7dC_d893DOCA1sYzVC14AnAFjcrV4e5hOCCZ4QIQAzeoaGkKY8JiBg0O04gTihSXaBRt7vMMY84zBE07X17UG7L-O1iuY2mHCM3rRv6jaYxkbvJmyj-7opP4zdRNKqaOHkYRtN603jutveX6LzStZeX_3MMVo_zlezp3j5unieTZdxSQQNMTBgmgMoRQrBhapAAcEClwlLS8VxorNKlpBpSSQWaUILRouCV0WaKhAY6Bjdnv4eXPPZah_yXdM621nmhJGUMiYS3qngpCpd473TVX5wZi_dMQec903lfVN531T-01THZH-Y0gTZxw9Omvpf8uZEGq31r5PAWZeQ0W-_M3dK
CODEN ITKEEH
CitedBy_id crossref_primary_10_1145_3626711
crossref_primary_10_1109_TKDE_2021_3060790
crossref_primary_10_1088_1742_6596_1651_1_012043
crossref_primary_10_1145_3447507
crossref_primary_10_1109_ACCESS_2025_3608236
crossref_primary_10_1109_TKDE_2021_3134806
Cites_doi 10.14778/2994509.2994535
10.1145/2187836.2187900
10.1145/956750.956759
10.1145/2882903.2915252
10.1145/352595.352598
10.14778/2732977.2732982
10.14778/2350229.2350263
10.3115/v1/D14-1162
10.1145/2505515.2505671
10.56021/9781421407944
10.1109/ICDE.2018.00070
10.14778/3236187.3236198
10.1145/3052771
10.1145/1807167.1807252
10.1145/2339530.2339707
10.1145/276698.276876
10.1145/3183713.3196926
10.1145/2723372.2723739
10.1145/2463676.2465280
10.1109/TKDE.2012.150
10.1145/775047.775087
10.14778/2536336.2536337
10.1109/TKDE.2007.250581
10.1109/MIS.2003.1234765
10.14778/2947618.2947624
10.1109/TKDE.2016.2611509
10.1145/775047.775116
10.1145/1060745.1060839
10.1016/S0306-4379(01)00042-4
10.1109/34.682181
ContentType Journal Article
Copyright Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2022
Copyright_xml – notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2022
DBID 97E
RIA
RIE
AAYXX
CITATION
7SC
7SP
8FD
JQ2
L7M
L~C
L~D
DOI 10.1109/TKDE.2020.2991063
DatabaseName IEEE Xplore (IEEE)
IEEE All-Society Periodicals Package (ASPP) 1998–Present
IEEE Xplore
CrossRef
Computer and Information Systems Abstracts
Electronics & Communications Abstracts
Technology Research Database
ProQuest Computer Science Collection
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts – Academic
Computer and Information Systems Abstracts Professional
DatabaseTitle CrossRef
Technology Research Database
Computer and Information Systems Abstracts – Academic
Electronics & Communications Abstracts
ProQuest Computer Science Collection
Computer and Information Systems Abstracts
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts Professional
DatabaseTitleList
Technology Research Database
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
Computer Science
EISSN 1558-2191
EndPage 1515
ExternalDocumentID 10_1109_TKDE_2020_2991063
9079896
Genre orig-research
GrantInformation_xml – fundername: MoE
  grantid: T1251RES1913
– fundername: National Natural Science Foundation of China
  grantid: 61932001; 61702016
  funderid: 10.13039/501100001809
GroupedDBID -~X
.DC
0R~
29I
4.4
5GY
6IK
97E
AAJGR
AARMG
AASAJ
AAWTH
ABAZT
ABQJQ
ABVLG
ACGFO
ACIWK
AENEX
AGQYO
AHBIQ
AKJIK
AKQYR
ALMA_UNASSIGNED_HOLDINGS
ASUFR
ATWAV
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CS3
DU5
EBS
EJD
F5P
HZ~
IEDLZ
IFIPE
IPLJI
JAVBF
LAI
M43
MS~
O9-
OCL
P2P
PQQKQ
RIA
RIE
RNS
RXW
TAE
TN5
UHB
AAYXX
CITATION
7SC
7SP
8FD
JQ2
L7M
L~C
L~D
ID FETCH-LOGICAL-c293t-1616e811dd2b989df1d12090c465cd804e7fac17ea2a09543b63bb8fb55d19013
IEDL.DBID RIE
ISICitedReferencesCount 8
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000752013800035&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 1041-4347
IngestDate Sun Nov 09 08:04:46 EST 2025
Tue Nov 18 22:17:19 EST 2025
Sat Nov 29 02:36:02 EST 2025
Wed Aug 27 03:00:17 EDT 2025
IsPeerReviewed true
IsScholarly true
Issue 3
Language English
License https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html
https://doi.org/10.15223/policy-029
https://doi.org/10.15223/policy-037
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c293t-1616e811dd2b989df1d12090c465cd804e7fac17ea2a09543b63bb8fb55d19013
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ORCID 0000-0002-9964-2470
0000-0001-9743-2034
PQID 2625366948
PQPubID 85438
PageCount 15
ParticipantIDs crossref_primary_10_1109_TKDE_2020_2991063
proquest_journals_2625366948
crossref_citationtrail_10_1109_TKDE_2020_2991063
ieee_primary_9079896
PublicationCentury 2000
PublicationDate 2022-03-01
PublicationDateYYYYMMDD 2022-03-01
PublicationDate_xml – month: 03
  year: 2022
  text: 2022-03-01
  day: 01
PublicationDecade 2020
PublicationPlace New York
PublicationPlace_xml – name: New York
PublicationTitle IEEE transactions on knowledge and data engineering
PublicationTitleAbbrev TKDE
PublicationYear 2022
Publisher IEEE
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Publisher_xml – name: IEEE
– name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
References ref13
ref35
ref12
ref34
ref15
ref14
ref36
ref31
ref30
ref11
ref10
ref2
ref17
ref16
ref18
Cohen (ref19)
Monge (ref1)
ref24
ref23
Arthur (ref32)
Bilenko (ref21) 2002
ref26
ref25
ref20
ref22
ref28
ref27
Sutskever (ref33)
ref29
ref8
ref7
ref9
ref4
ref3
ref6
Ravikumar (ref5)
References_xml – ident: ref25
  doi: 10.14778/2994509.2994535
– start-page: 267
  volume-title: Proc. Int. Conf. Knowl. Discov. Data Mining
  ident: ref1
  article-title: The field matching problem: Algorithms and applications
– ident: ref26
  doi: 10.1145/2187836.2187900
– ident: ref6
  doi: 10.1145/956750.956759
– ident: ref16
  doi: 10.1145/2882903.2915252
– start-page: 1027
  volume-title: Proc. Annu. ACM-SIAM Symp. Discrete Algorithms
  ident: ref32
  article-title: k-means++: The advantages of careful seeding
– start-page: 1139
  volume-title: Proc. Int. Conf. Mach. Learn.
  ident: ref33
  article-title: On the importance of initialization and momentum in deep learning
– ident: ref2
  doi: 10.1145/352595.352598
– ident: ref14
  doi: 10.14778/2732977.2732982
– start-page: 454
  volume-title: Proc. Conf. Uncertainty Artif. Intell.
  ident: ref5
  article-title: A hierarchical graphical model for record linkage
– ident: ref11
  doi: 10.14778/2350229.2350263
– ident: ref9
  doi: 10.3115/v1/D14-1162
– ident: ref29
  doi: 10.1145/2505515.2505671
– ident: ref30
  doi: 10.56021/9781421407944
– ident: ref17
  doi: 10.1109/ICDE.2018.00070
– ident: ref8
  doi: 10.14778/3236187.3236198
– ident: ref28
  doi: 10.1145/3052771
– ident: ref7
  doi: 10.1145/1807167.1807252
– ident: ref24
  doi: 10.1145/2339530.2339707
– ident: ref35
  doi: 10.1145/276698.276876
– ident: ref10
  doi: 10.1145/3183713.3196926
– ident: ref15
  doi: 10.1145/2723372.2723739
– ident: ref13
  doi: 10.1145/2463676.2465280
– ident: ref36
  doi: 10.1109/TKDE.2012.150
– ident: ref4
  doi: 10.1145/775047.775087
– ident: ref12
  doi: 10.14778/2536336.2536337
– ident: ref18
  doi: 10.1109/TKDE.2007.250581
– ident: ref3
  doi: 10.1109/MIS.2003.1234765
– start-page: 73
  volume-title: Proc. Int. Joint Conf. Artif. Intell.
  ident: ref19
  article-title: A comparison of string distance metrics for name-matching tasks
– ident: ref34
  doi: 10.14778/2947618.2947624
– year: 2002
  ident: ref21
  article-title: Learning to combine trained distance metrics for duplicate detection in databases
– ident: ref27
  doi: 10.1109/TKDE.2016.2611509
– ident: ref23
  doi: 10.1145/775047.775116
– ident: ref31
  doi: 10.1145/1060745.1060839
– ident: ref22
  doi: 10.1016/S0306-4379(01)00042-4
– ident: ref20
  doi: 10.1109/34.682181
SSID ssj0008781
Score 2.4113863
Snippet Entity resolution identifies all records in a database that refer to the same entity. In this paper, we propose an unsupervised framework for entity resolution...
SourceID proquest
crossref
ieee
SourceType Aggregation Database
Enrichment Source
Index Database
Publisher
StartPage 1501
SubjectTerms Algorithms
Bipartite graph
Blocking
Crowdsourcing
Data processing
Deep learning
Graph theory
graph-based algorithm
Machine learning
Measurement
Nuclear power plants
Redundancy
Task analysis
Training
Unsupervised entity resolution
Title Unsupervised Entity Resolution With Blocking and Graph Algorithms
URI https://ieeexplore.ieee.org/document/9079896
https://www.proquest.com/docview/2625366948
Volume 34
WOSCitedRecordID wos000752013800035&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVIEE
  databaseName: IEEE Electronic Library (IEL)
  customDbUrl:
  eissn: 1558-2191
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0008781
  issn: 1041-4347
  databaseCode: RIE
  dateStart: 19890101
  isFulltext: true
  titleUrlDefault: https://ieeexplore.ieee.org/
  providerName: IEEE
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3PS8MwFH5M8aAHf2yK8xc5eBK7pU1t2uPUTUEZHjbcrbRJqoPZjXUT_O99L-uGogjeCk0g5CXvfV-S9z2Ac2NI8JQLRyqdOH6QaifUPq5llSGazXgSKasz-yi73XAwiJ4qcLnKhTHG2MdnpkGf9i5fj9WcjsqaSOSiMArWYE3KYJGrtfK6obQFSZFdICcSvixvMF0eNXsPt21kgh5voO9FCiS-xSBbVOWHJ7bhpbPzv4HtwnYJI1lrYfc9qJi8CjvLEg2s3LFV2PqiN1iDVj8v5hPyDoXRrE0Zuh-MDvAXy489D2ev7BrDG52fsyTX7I70rFlr9DKe4r-3Yh_6nXbv5t4payg4CgP5zEFAF5jQdbX2UhyizlxN2bJc-QHJAnDfyCxRrjSJlyDa8kUaiDQNs_TqShNWEAewno9zcwgMwYHBXpFUaeIrqVNB9FAZKTJuEAbUgS9nNValwDjVuRjFlmjwKCZDxGSIuDREHS5WXSYLdY2_Gtdo5lcNy0mvw8nSdHG5_4rYQ1ongiDyw6Pfex3DpkeJDPY12Qmsz6Zzcwob6n02LKZndml9AnQaypE
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LS8QwEB58gXrwLa7PHDyJddOmNu1x1fWB6-JhRW-lTVIVtCvbXcF_70w2uyiK4K3QBEImmfm-JPMNwL4xJHjKhSeVzrwwyrUX6xDXsioQzRY8S5TVmW3Jdjt-eEhuJ-BwnAtjjLGPz8wRfdq7fN1VAzoqqyORS-IkmoRpqpzlsrXGfjeWtiQp8gtkRSKU7g7T50m9c33WRC4Y8CP0vkiCxLcoZMuq_PDFNsCcL_5vaEuw4IAkawwtvwwTplyBxVGRBub27ArMf1EcXIXGXVkN3sg_VEazJuXofjA6wh8uQHb_3H9iJxjg6ASdZaVmF6RozRovj90e_nut1uDuvNk5vfRcFQVPYSjvewjpIhP7vtZBjkPUha8pX5arMCJhAB4aWWTKlyYLMsRbocgjkedxkR8fa0ILYh2mym5pNoAhPDDYK5Eqz0IldS6IICojRcENAoEa8NGspspJjFOli5fUUg2epGSIlAyROkPU4GDc5W2or_FX41Wa-XFDN-k12B6ZLnU7sEoDJHYiipIw3vy91x7MXnZuWmnrqn29BXMBpTXYt2XbMNXvDcwOzKj3_nPV27XL7BMuW83a
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Unsupervised+Entity+Resolution+With+Blocking+and+Graph+Algorithms&rft.jtitle=IEEE+transactions+on+knowledge+and+data+engineering&rft.au=Zhang%2C+Dongxiang&rft.au=Li%2C+Dongsheng&rft.au=Long%2C+Guo&rft.au=Kian-Lee%2C+Tan&rft.date=2022-03-01&rft.pub=The+Institute+of+Electrical+and+Electronics+Engineers%2C+Inc.+%28IEEE%29&rft.issn=1041-4347&rft.eissn=1558-2191&rft.volume=34&rft.issue=3&rft.spage=1501&rft_id=info:doi/10.1109%2FTKDE.2020.2991063&rft.externalDBID=NO_FULL_TEXT
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1041-4347&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1041-4347&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1041-4347&client=summon