Unsupervised Entity Resolution With Blocking and Graph Algorithms
Entity resolution identifies all records in a database that refer to the same entity. In this paper, we propose an unsupervised framework for entity resolution using blocking and graph algorithms. The records are partitioned into blocks with no redundancy for efficiency improvement. For intra-block...
Uložené v:
| Vydané v: | IEEE transactions on knowledge and data engineering Ročník 34; číslo 3; s. 1501 - 1515 |
|---|---|
| Hlavní autori: | , , , |
| Médium: | Journal Article |
| Jazyk: | English |
| Vydavateľské údaje: |
New York
IEEE
01.03.2022
The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
| Predmet: | |
| ISSN: | 1041-4347, 1558-2191 |
| On-line prístup: | Získať plný text |
| Tagy: |
Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
|
| Abstract | Entity resolution identifies all records in a database that refer to the same entity. In this paper, we propose an unsupervised framework for entity resolution using blocking and graph algorithms. The records are partitioned into blocks with no redundancy for efficiency improvement. For intra-block data processing, we propose a graph-theoretic fusion framework with two components, namely ITER and CliqueRank. Specifically, ITER constructs a weighted bipartite graph between terms and record-record pairs and iteratively propagates the node salience until convergence. Subsequently, CliqueRank constructs a record graph to estimate the likelihood of two records resident in the same clique. The derived likelihood from CliqueRank is fed back to ITER to rectify the edge weight until a joint optimum can be reached. Experimental evaluation was conducted with 4 real datasets. Results show that our unsupervised framework is comparable or even superior to state-of-the-art deep learning approaches. |
|---|---|
| AbstractList | Entity resolution identifies all records in a database that refer to the same entity. In this paper, we propose an unsupervised framework for entity resolution using blocking and graph algorithms. The records are partitioned into blocks with no redundancy for efficiency improvement. For intra-block data processing, we propose a graph-theoretic fusion framework with two components, namely ITER and CliqueRank. Specifically, ITER constructs a weighted bipartite graph between terms and record-record pairs and iteratively propagates the node salience until convergence. Subsequently, CliqueRank constructs a record graph to estimate the likelihood of two records resident in the same clique. The derived likelihood from CliqueRank is fed back to ITER to rectify the edge weight until a joint optimum can be reached. Experimental evaluation was conducted with 4 real datasets. Results show that our unsupervised framework is comparable or even superior to state-of-the-art deep learning approaches. |
| Author | Zhang, Dongxiang Li, Dongsheng Guo, Long Tan, Kian-Lee |
| Author_xml | – sequence: 1 givenname: Dongxiang orcidid: 0000-0002-9964-2470 surname: Zhang fullname: Zhang, Dongxiang email: zhangdongxiang37@gmail.com organization: College of Computer Science and Technology, Zhejiang University, Hangzhou, Zhejiang, China – sequence: 2 givenname: Dongsheng orcidid: 0000-0001-9743-2034 surname: Li fullname: Li, Dongsheng email: dsli@nudt.edu.cn organization: School of Computer Science, National University of Defense Technology, Changsha, Hunan, China – sequence: 3 givenname: Long surname: Guo fullname: Guo, Long email: leo.gl@alibaba-inc.com organization: Alibaba Group, Hangzhou, China – sequence: 4 givenname: Kian-Lee surname: Tan fullname: Tan, Kian-Lee email: tankl@comp.nus.edu.sg organization: School of Computing, National University of Singapore, Singapore |
| BookMark | eNp9kFFLwzAUhYNMcJv-APGl4HNnbtKmyeOcc4oDQTZ8LGmTbpldOpNU2L-3ZeKDDz7dC_d893DOCA1sYzVC14AnAFjcrV4e5hOCCZ4QIQAzeoaGkKY8JiBg0O04gTihSXaBRt7vMMY84zBE07X17UG7L-O1iuY2mHCM3rRv6jaYxkbvJmyj-7opP4zdRNKqaOHkYRtN603jutveX6LzStZeX_3MMVo_zlezp3j5unieTZdxSQQNMTBgmgMoRQrBhapAAcEClwlLS8VxorNKlpBpSSQWaUILRouCV0WaKhAY6Bjdnv4eXPPZah_yXdM621nmhJGUMiYS3qngpCpd473TVX5wZi_dMQec903lfVN531T-01THZH-Y0gTZxw9Omvpf8uZEGq31r5PAWZeQ0W-_M3dK |
| CODEN | ITKEEH |
| CitedBy_id | crossref_primary_10_1145_3626711 crossref_primary_10_1109_TKDE_2021_3060790 crossref_primary_10_1088_1742_6596_1651_1_012043 crossref_primary_10_1145_3447507 crossref_primary_10_1109_ACCESS_2025_3608236 crossref_primary_10_1109_TKDE_2021_3134806 |
| Cites_doi | 10.14778/2994509.2994535 10.1145/2187836.2187900 10.1145/956750.956759 10.1145/2882903.2915252 10.1145/352595.352598 10.14778/2732977.2732982 10.14778/2350229.2350263 10.3115/v1/D14-1162 10.1145/2505515.2505671 10.56021/9781421407944 10.1109/ICDE.2018.00070 10.14778/3236187.3236198 10.1145/3052771 10.1145/1807167.1807252 10.1145/2339530.2339707 10.1145/276698.276876 10.1145/3183713.3196926 10.1145/2723372.2723739 10.1145/2463676.2465280 10.1109/TKDE.2012.150 10.1145/775047.775087 10.14778/2536336.2536337 10.1109/TKDE.2007.250581 10.1109/MIS.2003.1234765 10.14778/2947618.2947624 10.1109/TKDE.2016.2611509 10.1145/775047.775116 10.1145/1060745.1060839 10.1016/S0306-4379(01)00042-4 10.1109/34.682181 |
| ContentType | Journal Article |
| Copyright | Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2022 |
| Copyright_xml | – notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2022 |
| DBID | 97E RIA RIE AAYXX CITATION 7SC 7SP 8FD JQ2 L7M L~C L~D |
| DOI | 10.1109/TKDE.2020.2991063 |
| DatabaseName | IEEE Xplore (IEEE) IEEE All-Society Periodicals Package (ASPP) 1998–Present IEEE Xplore CrossRef Computer and Information Systems Abstracts Electronics & Communications Abstracts Technology Research Database ProQuest Computer Science Collection Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional |
| DatabaseTitle | CrossRef Technology Research Database Computer and Information Systems Abstracts – Academic Electronics & Communications Abstracts ProQuest Computer Science Collection Computer and Information Systems Abstracts Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Professional |
| DatabaseTitleList | Technology Research Database |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Engineering Computer Science |
| EISSN | 1558-2191 |
| EndPage | 1515 |
| ExternalDocumentID | 10_1109_TKDE_2020_2991063 9079896 |
| Genre | orig-research |
| GrantInformation_xml | – fundername: MoE grantid: T1251RES1913 – fundername: National Natural Science Foundation of China grantid: 61932001; 61702016 funderid: 10.13039/501100001809 |
| GroupedDBID | -~X .DC 0R~ 29I 4.4 5GY 6IK 97E AAJGR AARMG AASAJ AAWTH ABAZT ABQJQ ABVLG ACGFO ACIWK AENEX AGQYO AHBIQ AKJIK AKQYR ALMA_UNASSIGNED_HOLDINGS ASUFR ATWAV BEFXN BFFAM BGNUA BKEBE BPEOZ CS3 DU5 EBS EJD F5P HZ~ IEDLZ IFIPE IPLJI JAVBF LAI M43 MS~ O9- OCL P2P PQQKQ RIA RIE RNS RXW TAE TN5 UHB AAYXX CITATION 7SC 7SP 8FD JQ2 L7M L~C L~D |
| ID | FETCH-LOGICAL-c293t-1616e811dd2b989df1d12090c465cd804e7fac17ea2a09543b63bb8fb55d19013 |
| IEDL.DBID | RIE |
| ISICitedReferencesCount | 8 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000752013800035&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 1041-4347 |
| IngestDate | Sun Nov 09 08:04:46 EST 2025 Tue Nov 18 22:17:19 EST 2025 Sat Nov 29 02:36:02 EST 2025 Wed Aug 27 03:00:17 EDT 2025 |
| IsPeerReviewed | true |
| IsScholarly | true |
| Issue | 3 |
| Language | English |
| License | https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html https://doi.org/10.15223/policy-029 https://doi.org/10.15223/policy-037 |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-c293t-1616e811dd2b989df1d12090c465cd804e7fac17ea2a09543b63bb8fb55d19013 |
| Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
| ORCID | 0000-0002-9964-2470 0000-0001-9743-2034 |
| PQID | 2625366948 |
| PQPubID | 85438 |
| PageCount | 15 |
| ParticipantIDs | crossref_primary_10_1109_TKDE_2020_2991063 proquest_journals_2625366948 crossref_citationtrail_10_1109_TKDE_2020_2991063 ieee_primary_9079896 |
| PublicationCentury | 2000 |
| PublicationDate | 2022-03-01 |
| PublicationDateYYYYMMDD | 2022-03-01 |
| PublicationDate_xml | – month: 03 year: 2022 text: 2022-03-01 day: 01 |
| PublicationDecade | 2020 |
| PublicationPlace | New York |
| PublicationPlace_xml | – name: New York |
| PublicationTitle | IEEE transactions on knowledge and data engineering |
| PublicationTitleAbbrev | TKDE |
| PublicationYear | 2022 |
| Publisher | IEEE The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
| Publisher_xml | – name: IEEE – name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
| References | ref13 ref35 ref12 ref34 ref15 ref14 ref36 ref31 ref30 ref11 ref10 ref2 ref17 ref16 ref18 Cohen (ref19) Monge (ref1) ref24 ref23 Arthur (ref32) Bilenko (ref21) 2002 ref26 ref25 ref20 ref22 ref28 ref27 Sutskever (ref33) ref29 ref8 ref7 ref9 ref4 ref3 ref6 Ravikumar (ref5) |
| References_xml | – ident: ref25 doi: 10.14778/2994509.2994535 – start-page: 267 volume-title: Proc. Int. Conf. Knowl. Discov. Data Mining ident: ref1 article-title: The field matching problem: Algorithms and applications – ident: ref26 doi: 10.1145/2187836.2187900 – ident: ref6 doi: 10.1145/956750.956759 – ident: ref16 doi: 10.1145/2882903.2915252 – start-page: 1027 volume-title: Proc. Annu. ACM-SIAM Symp. Discrete Algorithms ident: ref32 article-title: k-means++: The advantages of careful seeding – start-page: 1139 volume-title: Proc. Int. Conf. Mach. Learn. ident: ref33 article-title: On the importance of initialization and momentum in deep learning – ident: ref2 doi: 10.1145/352595.352598 – ident: ref14 doi: 10.14778/2732977.2732982 – start-page: 454 volume-title: Proc. Conf. Uncertainty Artif. Intell. ident: ref5 article-title: A hierarchical graphical model for record linkage – ident: ref11 doi: 10.14778/2350229.2350263 – ident: ref9 doi: 10.3115/v1/D14-1162 – ident: ref29 doi: 10.1145/2505515.2505671 – ident: ref30 doi: 10.56021/9781421407944 – ident: ref17 doi: 10.1109/ICDE.2018.00070 – ident: ref8 doi: 10.14778/3236187.3236198 – ident: ref28 doi: 10.1145/3052771 – ident: ref7 doi: 10.1145/1807167.1807252 – ident: ref24 doi: 10.1145/2339530.2339707 – ident: ref35 doi: 10.1145/276698.276876 – ident: ref10 doi: 10.1145/3183713.3196926 – ident: ref15 doi: 10.1145/2723372.2723739 – ident: ref13 doi: 10.1145/2463676.2465280 – ident: ref36 doi: 10.1109/TKDE.2012.150 – ident: ref4 doi: 10.1145/775047.775087 – ident: ref12 doi: 10.14778/2536336.2536337 – ident: ref18 doi: 10.1109/TKDE.2007.250581 – ident: ref3 doi: 10.1109/MIS.2003.1234765 – start-page: 73 volume-title: Proc. Int. Joint Conf. Artif. Intell. ident: ref19 article-title: A comparison of string distance metrics for name-matching tasks – ident: ref34 doi: 10.14778/2947618.2947624 – year: 2002 ident: ref21 article-title: Learning to combine trained distance metrics for duplicate detection in databases – ident: ref27 doi: 10.1109/TKDE.2016.2611509 – ident: ref23 doi: 10.1145/775047.775116 – ident: ref31 doi: 10.1145/1060745.1060839 – ident: ref22 doi: 10.1016/S0306-4379(01)00042-4 – ident: ref20 doi: 10.1109/34.682181 |
| SSID | ssj0008781 |
| Score | 2.4113863 |
| Snippet | Entity resolution identifies all records in a database that refer to the same entity. In this paper, we propose an unsupervised framework for entity resolution... |
| SourceID | proquest crossref ieee |
| SourceType | Aggregation Database Enrichment Source Index Database Publisher |
| StartPage | 1501 |
| SubjectTerms | Algorithms Bipartite graph Blocking Crowdsourcing Data processing Deep learning Graph theory graph-based algorithm Machine learning Measurement Nuclear power plants Redundancy Task analysis Training Unsupervised entity resolution |
| Title | Unsupervised Entity Resolution With Blocking and Graph Algorithms |
| URI | https://ieeexplore.ieee.org/document/9079896 https://www.proquest.com/docview/2625366948 |
| Volume | 34 |
| WOSCitedRecordID | wos000752013800035&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVIEE databaseName: IEEE Electronic Library (IEL) customDbUrl: eissn: 1558-2191 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0008781 issn: 1041-4347 databaseCode: RIE dateStart: 19890101 isFulltext: true titleUrlDefault: https://ieeexplore.ieee.org/ providerName: IEEE |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3PS8MwFH5M8aAHf2yK8xc5eBK7pU1t2uPUTUEZHjbcrbRJqoPZjXUT_O99L-uGogjeCk0g5CXvfV-S9z2Ac2NI8JQLRyqdOH6QaifUPq5llSGazXgSKasz-yi73XAwiJ4qcLnKhTHG2MdnpkGf9i5fj9WcjsqaSOSiMArWYE3KYJGrtfK6obQFSZFdICcSvixvMF0eNXsPt21kgh5voO9FCiS-xSBbVOWHJ7bhpbPzv4HtwnYJI1lrYfc9qJi8CjvLEg2s3LFV2PqiN1iDVj8v5hPyDoXRrE0Zuh-MDvAXy489D2ev7BrDG52fsyTX7I70rFlr9DKe4r-3Yh_6nXbv5t4payg4CgP5zEFAF5jQdbX2UhyizlxN2bJc-QHJAnDfyCxRrjSJlyDa8kUaiDQNs_TqShNWEAewno9zcwgMwYHBXpFUaeIrqVNB9FAZKTJuEAbUgS9nNValwDjVuRjFlmjwKCZDxGSIuDREHS5WXSYLdY2_Gtdo5lcNy0mvw8nSdHG5_4rYQ1ongiDyw6Pfex3DpkeJDPY12Qmsz6Zzcwob6n02LKZndml9AnQaypE |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LS8QwEB58gXrwLa7PHDyJddOmNu1x1fWB6-JhRW-lTVIVtCvbXcF_70w2uyiK4K3QBEImmfm-JPMNwL4xJHjKhSeVzrwwyrUX6xDXsioQzRY8S5TVmW3Jdjt-eEhuJ-BwnAtjjLGPz8wRfdq7fN1VAzoqqyORS-IkmoRpqpzlsrXGfjeWtiQp8gtkRSKU7g7T50m9c33WRC4Y8CP0vkiCxLcoZMuq_PDFNsCcL_5vaEuw4IAkawwtvwwTplyBxVGRBub27ArMf1EcXIXGXVkN3sg_VEazJuXofjA6wh8uQHb_3H9iJxjg6ASdZaVmF6RozRovj90e_nut1uDuvNk5vfRcFQVPYSjvewjpIhP7vtZBjkPUha8pX5arMCJhAB4aWWTKlyYLMsRbocgjkedxkR8fa0ILYh2mym5pNoAhPDDYK5Eqz0IldS6IICojRcENAoEa8NGspspJjFOli5fUUg2epGSIlAyROkPU4GDc5W2or_FX41Wa-XFDN-k12B6ZLnU7sEoDJHYiipIw3vy91x7MXnZuWmnrqn29BXMBpTXYt2XbMNXvDcwOzKj3_nPV27XL7BMuW83a |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Unsupervised+Entity+Resolution+With+Blocking+and+Graph+Algorithms&rft.jtitle=IEEE+transactions+on+knowledge+and+data+engineering&rft.au=Zhang%2C+Dongxiang&rft.au=Li%2C+Dongsheng&rft.au=Long%2C+Guo&rft.au=Kian-Lee%2C+Tan&rft.date=2022-03-01&rft.pub=The+Institute+of+Electrical+and+Electronics+Engineers%2C+Inc.+%28IEEE%29&rft.issn=1041-4347&rft.eissn=1558-2191&rft.volume=34&rft.issue=3&rft.spage=1501&rft_id=info:doi/10.1109%2FTKDE.2020.2991063&rft.externalDBID=NO_FULL_TEXT |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1041-4347&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1041-4347&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1041-4347&client=summon |