A Suite of Efficient Randomized Algorithms for Streaming Record Linkage
Organizations leverage massive volumes of information and new types of data to generate unprecedented insights and improve their outcomes. Correctly identifying duplicate records that represent the same entity, such as user, customer, patient and so on, a process commonly known as record linkage, ca...
Gespeichert in:
| Veröffentlicht in: | IEEE transactions on knowledge and data engineering Jg. 36; H. 7; S. 2803 - 2813 |
|---|---|
| Hauptverfasser: | , , |
| Format: | Journal Article |
| Sprache: | Englisch |
| Veröffentlicht: |
New York
IEEE
01.07.2024
The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
| Schlagworte: | |
| ISSN: | 1041-4347, 1558-2191 |
| Online-Zugang: | Volltext |
| Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
| Abstract | Organizations leverage massive volumes of information and new types of data to generate unprecedented insights and improve their outcomes. Correctly identifying duplicate records that represent the same entity, such as user, customer, patient and so on, a process commonly known as record linkage, can improve service levels, accelerate sales, or elevate healthcare decision support. Towards this direction, blocking methods are used with the aim to group matching records in the same block using a combination of their attributes as blocking keys. This paper introduces a suite of randomized algorithms specifically crafted for streaming record linkage settings. Using a bounded in-memory data structure, in terms of the number of blocks and positions within each block, our algorithms guarantee that the most frequently accessed and the most recently used blocks remain in main memory and, additionally, the records within a block are renewed on a rolling basis. The operation of our algorithms rely on simple random choices, instead of utilizing cumbersome sorting data structures, which ensure that the probability of inactive blocks and older records to remain in main memory decays in order to free space for more promising blocks and fresher records, respectively. We also introduce an algorithm that performs approximate blocking to tackle the problem of misspellings and typos present in the blocking keys. The experimental evaluation showcases that our proposed algorithms scale efficiently to data streams by providing certain accuracy guarantees. |
|---|---|
| AbstractList | Organizations leverage massive volumes of information and new types of data to generate unprecedented insights and improve their outcomes. Correctly identifying duplicate records that represent the same entity, such as user, customer, patient and so on, a process commonly known as record linkage, can improve service levels, accelerate sales, or elevate healthcare decision support. Towards this direction, blocking methods are used with the aim to group matching records in the same block using a combination of their attributes as blocking keys. This paper introduces a suite of randomized algorithms specifically crafted for streaming record linkage settings. Using a bounded in-memory data structure, in terms of the number of blocks and positions within each block, our algorithms guarantee that the most frequently accessed and the most recently used blocks remain in main memory and, additionally, the records within a block are renewed on a rolling basis. The operation of our algorithms rely on simple random choices, instead of utilizing cumbersome sorting data structures, which ensure that the probability of inactive blocks and older records to remain in main memory decays in order to free space for more promising blocks and fresher records, respectively. We also introduce an algorithm that performs approximate blocking to tackle the problem of misspellings and typos present in the blocking keys. The experimental evaluation showcases that our proposed algorithms scale efficiently to data streams by providing certain accuracy guarantees. |
| Author | Karapiperis, Dimitrios Tjortjis, Christos Verykios, Vassilios S. |
| Author_xml | – sequence: 1 givenname: Dimitrios orcidid: 0000-0002-3878-5988 surname: Karapiperis fullname: Karapiperis, Dimitrios email: dkarapiperis@ihu.edu.gr organization: International Hellenic University, Thermi, Thessaloniki, Greece – sequence: 2 givenname: Christos orcidid: 0000-0001-8263-9024 surname: Tjortjis fullname: Tjortjis, Christos email: c.tjortjis@ihu.edu.gr organization: International Hellenic University, Thermi, Thessaloniki, Greece – sequence: 3 givenname: Vassilios S. orcidid: 0000-0002-9758-0819 surname: Verykios fullname: Verykios, Vassilios S. email: verykios@eap.gr organization: Hellenic Open University, Patras, Greece |
| BookMark | eNp9kEtPQjEQhRuDiYD-ABMXTVxf7Ou-lgQRjSQmgOumtFMsQou9ZaG_3ouwMC5czSTnnHl8PdTxwQNC15QMKCX13eL5fjxghIkB5wUljJ2hLs3zKmO0pp22J4JmgovyAvWaZk0IqcqKdtFkiOd7lwAHi8fWOu3AJzxT3oSt-wKDh5tViC69bRtsQ8TzFEFtnV_hGegQDZ46_65WcInOrdo0cHWqffT6MF6MHrPpy-RpNJxmmtUiZUvBiqXVhtuqKCkta86M0RX_uQZyaLWKc1q04pIrS7VRLLcccsUrZgrgfXR7nLuL4WMPTZLrsI--XSk5KUTZ_lWz1kWPLh1D00SwchfdVsVPSYk88JIHXvLAS554tZnyT0a7pJILPkXlNv8mb45JBwC_Ngla5XnBvwG5YHkY |
| CODEN | ITKEEH |
| CitedBy_id | crossref_primary_10_1016_j_is_2024_102506 crossref_primary_10_3390_axioms13110795 crossref_primary_10_1109_ACCESS_2025_3599990 |
| Cites_doi | 10.1109/TKDE.2014.2349916 10.1007/s10618-018-0563-0 10.1109/BigData50022.2020.9378127 10.1007/s10619-019-07263-0 10.1109/TKDE.2013.54 10.14778/3611479.3611487 10.14778/2733085.2733098 10.14778/1920841.1920898 10.1145/2740908.2742839 10.1109/ICDE51399.2021.00112 10.1109/ICDE.2018.00015 10.1109/BigData52589.2021.9671540 10.14778/2876473.2876474 10.1145/276698.276781 10.1109/TKDE.2014.2359666 10.14778/2732939.2732943 10.1109/TKDE.2012.43 10.1109/TKDE.2011.127 10.14778/2556549.2556567 10.1145/3448016.3457238 10.1145/3341105.3375776 |
| ContentType | Journal Article |
| Copyright | Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2024 |
| Copyright_xml | – notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2024 |
| DBID | 97E RIA RIE AAYXX CITATION 7SC 7SP 8FD JQ2 L7M L~C L~D |
| DOI | 10.1109/TKDE.2024.3361022 |
| DatabaseName | IEEE All-Society Periodicals Package (ASPP) 2005–Present IEEE All-Society Periodicals Package (ASPP) 1998–Present IEEE Electronic Library (IEL) CrossRef Computer and Information Systems Abstracts Electronics & Communications Abstracts Technology Research Database ProQuest Computer Science Collection Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional |
| DatabaseTitle | CrossRef Technology Research Database Computer and Information Systems Abstracts – Academic Electronics & Communications Abstracts ProQuest Computer Science Collection Computer and Information Systems Abstracts Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Professional |
| DatabaseTitleList | Technology Research Database |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Engineering Computer Science |
| EISSN | 1558-2191 |
| EndPage | 2813 |
| ExternalDocumentID | 10_1109_TKDE_2024_3361022 10418556 |
| Genre | orig-research |
| GroupedDBID | -~X .DC 0R~ 1OL 29I 4.4 5GY 5VS 6IK 97E 9M8 AAJGR AARMG AASAJ AAWTH ABAZT ABFSI ABQJQ ABVLG ACGFO ACIWK AENEX AETIX AGQYO AGSQL AHBIQ AI. AIBXA AKJIK AKQYR ALLEH ALMA_UNASSIGNED_HOLDINGS ASUFR ATWAV BEFXN BFFAM BGNUA BKEBE BPEOZ CS3 DU5 E.L EBS EJD F5P HZ~ H~9 ICLAB IEDLZ IFIPE IFJZH IPLJI JAVBF LAI M43 MS~ O9- OCL P2P PQQKQ RIA RIE RNI RNS RXW RZB TAE TAF TN5 UHB VH1 AAYXX CITATION 7SC 7SP 8FD JQ2 L7M L~C L~D |
| ID | FETCH-LOGICAL-c294t-b426bfcd3f867117932ddc8308781e5ebfc83316671b3af1cda25f3e5a382d6e3 |
| IEDL.DBID | RIE |
| ISICitedReferencesCount | 4 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001245017200013&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 1041-4347 |
| IngestDate | Sun Nov 30 04:02:33 EST 2025 Tue Nov 18 21:57:15 EST 2025 Sat Nov 29 02:36:08 EST 2025 Wed Aug 27 02:06:32 EDT 2025 |
| IsPeerReviewed | true |
| IsScholarly | true |
| Issue | 7 |
| Language | English |
| License | https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html https://doi.org/10.15223/policy-029 https://doi.org/10.15223/policy-037 |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-c294t-b426bfcd3f867117932ddc8308781e5ebfc83316671b3af1cda25f3e5a382d6e3 |
| Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
| ORCID | 0000-0001-8263-9024 0000-0002-9758-0819 0000-0002-3878-5988 |
| PQID | 3064700892 |
| PQPubID | 85438 |
| PageCount | 11 |
| ParticipantIDs | ieee_primary_10418556 crossref_primary_10_1109_TKDE_2024_3361022 proquest_journals_3064700892 crossref_citationtrail_10_1109_TKDE_2024_3361022 |
| PublicationCentury | 2000 |
| PublicationDate | 2024-07-01 |
| PublicationDateYYYYMMDD | 2024-07-01 |
| PublicationDate_xml | – month: 07 year: 2024 text: 2024-07-01 day: 01 |
| PublicationDecade | 2020 |
| PublicationPlace | New York |
| PublicationPlace_xml | – name: New York |
| PublicationTitle | IEEE transactions on knowledge and data engineering |
| PublicationTitleAbbrev | TKDE |
| PublicationYear | 2024 |
| Publisher | IEEE The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
| Publisher_xml | – name: IEEE – name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
| References | ref13 ref24 ref12 ref23 ref15 ref14 ref20 Gionis (ref21) ref11 ref22 ref10 Gazzari (ref17) ref2 ref1 ref16 ref19 Karapiperis (ref3) ref18 ref8 ref7 ref9 ref4 ref6 ref5 |
| References_xml | – ident: ref1 doi: 10.1109/TKDE.2014.2349916 – ident: ref11 doi: 10.1007/s10618-018-0563-0 – ident: ref4 doi: 10.1109/BigData50022.2020.9378127 – ident: ref5 doi: 10.1007/s10619-019-07263-0 – ident: ref20 doi: 10.1109/TKDE.2013.54 – ident: ref8 doi: 10.14778/3611479.3611487 – ident: ref19 doi: 10.14778/2733085.2733098 – ident: ref9 doi: 10.14778/1920841.1920898 – ident: ref24 doi: 10.1145/2740908.2742839 – start-page: 80 volume-title: Proc. Annu. Int. Conf. Extending Database Technol. ident: ref17 article-title: Progressive entity resolution over incremental data – ident: ref6 doi: 10.1109/ICDE51399.2021.00112 – ident: ref15 doi: 10.1109/ICDE.2018.00015 – ident: ref16 doi: 10.1109/BigData52589.2021.9671540 – ident: ref14 doi: 10.14778/2876473.2876474 – ident: ref23 doi: 10.1145/276698.276781 – ident: ref13 doi: 10.1109/TKDE.2014.2359666 – ident: ref18 doi: 10.14778/2732939.2732943 – start-page: 73 volume-title: Proc. Int. Conf. Extending Database Technol. ident: ref3 article-title: Summarization algorithms for record linkage – ident: ref12 doi: 10.1109/TKDE.2012.43 – ident: ref22 doi: 10.1109/TKDE.2011.127 – ident: ref10 doi: 10.14778/2556549.2556567 – ident: ref2 doi: 10.1145/3448016.3457238 – ident: ref7 doi: 10.1145/3341105.3375776 – start-page: 518 volume-title: Proc. 25th Int. Conf. Very Large Data Bases ident: ref21 article-title: Similarity search in high dimensions via hashing |
| SSID | ssj0008781 |
| Score | 2.4600136 |
| Snippet | Organizations leverage massive volumes of information and new types of data to generate unprecedented insights and improve their outcomes. Correctly... |
| SourceID | proquest crossref ieee |
| SourceType | Aggregation Database Enrichment Source Index Database Publisher |
| StartPage | 2803 |
| SubjectTerms | Algorithms Approximation algorithms Blocking Couplings Data integration Data structures Data transmission entity resolution Indexes randomization Real-time systems record linkage Soft sensors Streams |
| Title | A Suite of Efficient Randomized Algorithms for Streaming Record Linkage |
| URI | https://ieeexplore.ieee.org/document/10418556 https://www.proquest.com/docview/3064700892 |
| Volume | 36 |
| WOSCitedRecordID | wos001245017200013&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVIEE databaseName: IEEE Electronic Library (IEL) customDbUrl: eissn: 1558-2191 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0008781 issn: 1041-4347 databaseCode: RIE dateStart: 19890101 isFulltext: true titleUrlDefault: https://ieeexplore.ieee.org/ providerName: IEEE |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3NS8MwFH_o8KAHp3PidEoOnoTOtmnT5Dh0UxCG6ITdSpoPHWyt7MODf71J2o2JKHgr5CWE98tL3q8veQ_gUsfGqc0w95gvfC8K_djLNCOeiAImKQtEJKUrNpEMBnQ0Yo_VY3X3FkYp5S6fqY79dLF8WYil_VVmLNymWonJNmwnCSkfa623XZq4iqRWxotwlFQhzMBn18OH256hgmHUwZhYivPtEHJVVX5sxe586df_ObMD2K8cSdQtkT-ELZU3oL4q0oAqm23A3kbGwSO466LnpfEyUaFRz2WPMOOiJ57LYjr-VBJ1J6_FbLx4m86RcWeRDVrzqemKSp6KLHc1W1ATXvq94c29V9VS8ETIooWXmZM400JibRPaWasMpRQUO7WpWJk2inFATKNBTgdC8jDWWMUc01AShY-hlhe5OgGkiDZ-AiU0wDzixDiYmQwU0zxONKcKt8BfKTcVVaJxW-9ikjrC4bPU4pFaPNIKjxZcrbu8l1k2_hJuWgA2BEvdt6C9gjCtDHGeWoKVmMXBwtNfup3Brh29vILbhtpitlTnsCM-FuP57MKtsS-PB8yl |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1JS8QwFH64gXpwGUcc1xw8CR3bpktyHHR0RB1ER_BW0iw64LQyiwd_vUmaGRRR8FZIXlvy5W1Z3gdwrGId1OaYedTnvheFfuzliiYejwIqCA14JIQlm0i7XfL0RO_cZXV7F0ZKaQ-fyaZ5tHv5ouQTs1SmNdyUWomTeVg01FnuutbM8JLUcpKaXl6Eo9RtYgY-Pe1dn7d1MhhGTYwTk-R8c0OWV-WHMbYe5mL9n_-2AWsulEStCvtNmJNFDdanNA3IaW0NVr_UHNyCyxZ6mOg4E5UKtW39CP1edM8KUQ76H1Kg1utzOeyPXwYjpANaZLat2UCLoipTRSZ71UaoDo8X7d5Zx3NsCh4PaTT2cu2Lc8UFVqakndHLUAhOsB02GUvdRjAOEt2osVMBFyyMFZYxwyQUicTbsFCUhdwBJBOlIwWSkACziCU6xMxFIKlicaoYkbgB_nRwM-5KjRvGi9fMphw-zQwemcEjc3g04GQm8lbV2firc90A8KVjNfYN2J9CmDlVHGUmxUr15KDh7i9iR7Dc6d3eZDdX3es9WDFfqg7k7sPCeDiRB7DE38f90fDQzrdPZMPP7g |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=A+Suite+of+Efficient+Randomized+Algorithms+for+Streaming+Record+Linkage&rft.jtitle=IEEE+transactions+on+knowledge+and+data+engineering&rft.au=Karapiperis%2C+Dimitrios&rft.au=Tjortjis%2C+Christos&rft.au=Verykios%2C+Vassilios+S&rft.date=2024-07-01&rft.pub=The+Institute+of+Electrical+and+Electronics+Engineers%2C+Inc.+%28IEEE%29&rft.issn=1041-4347&rft.eissn=1558-2191&rft.volume=36&rft.issue=7&rft.spage=2803&rft_id=info:doi/10.1109%2FTKDE.2024.3361022&rft.externalDBID=NO_FULL_TEXT |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1041-4347&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1041-4347&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1041-4347&client=summon |