A Suite of Efficient Randomized Algorithms for Streaming Record Linkage

Organizations leverage massive volumes of information and new types of data to generate unprecedented insights and improve their outcomes. Correctly identifying duplicate records that represent the same entity, such as user, customer, patient and so on, a process commonly known as record linkage, ca...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on knowledge and data engineering Jg. 36; H. 7; S. 2803 - 2813
Hauptverfasser: Karapiperis, Dimitrios, Tjortjis, Christos, Verykios, Vassilios S.
Format: Journal Article
Sprache:Englisch
Veröffentlicht: New York IEEE 01.07.2024
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Schlagworte:
ISSN:1041-4347, 1558-2191
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Abstract Organizations leverage massive volumes of information and new types of data to generate unprecedented insights and improve their outcomes. Correctly identifying duplicate records that represent the same entity, such as user, customer, patient and so on, a process commonly known as record linkage, can improve service levels, accelerate sales, or elevate healthcare decision support. Towards this direction, blocking methods are used with the aim to group matching records in the same block using a combination of their attributes as blocking keys. This paper introduces a suite of randomized algorithms specifically crafted for streaming record linkage settings. Using a bounded in-memory data structure, in terms of the number of blocks and positions within each block, our algorithms guarantee that the most frequently accessed and the most recently used blocks remain in main memory and, additionally, the records within a block are renewed on a rolling basis. The operation of our algorithms rely on simple random choices, instead of utilizing cumbersome sorting data structures, which ensure that the probability of inactive blocks and older records to remain in main memory decays in order to free space for more promising blocks and fresher records, respectively. We also introduce an algorithm that performs approximate blocking to tackle the problem of misspellings and typos present in the blocking keys. The experimental evaluation showcases that our proposed algorithms scale efficiently to data streams by providing certain accuracy guarantees.
AbstractList Organizations leverage massive volumes of information and new types of data to generate unprecedented insights and improve their outcomes. Correctly identifying duplicate records that represent the same entity, such as user, customer, patient and so on, a process commonly known as record linkage, can improve service levels, accelerate sales, or elevate healthcare decision support. Towards this direction, blocking methods are used with the aim to group matching records in the same block using a combination of their attributes as blocking keys. This paper introduces a suite of randomized algorithms specifically crafted for streaming record linkage settings. Using a bounded in-memory data structure, in terms of the number of blocks and positions within each block, our algorithms guarantee that the most frequently accessed and the most recently used blocks remain in main memory and, additionally, the records within a block are renewed on a rolling basis. The operation of our algorithms rely on simple random choices, instead of utilizing cumbersome sorting data structures, which ensure that the probability of inactive blocks and older records to remain in main memory decays in order to free space for more promising blocks and fresher records, respectively. We also introduce an algorithm that performs approximate blocking to tackle the problem of misspellings and typos present in the blocking keys. The experimental evaluation showcases that our proposed algorithms scale efficiently to data streams by providing certain accuracy guarantees.
Author Karapiperis, Dimitrios
Tjortjis, Christos
Verykios, Vassilios S.
Author_xml – sequence: 1
  givenname: Dimitrios
  orcidid: 0000-0002-3878-5988
  surname: Karapiperis
  fullname: Karapiperis, Dimitrios
  email: dkarapiperis@ihu.edu.gr
  organization: International Hellenic University, Thermi, Thessaloniki, Greece
– sequence: 2
  givenname: Christos
  orcidid: 0000-0001-8263-9024
  surname: Tjortjis
  fullname: Tjortjis, Christos
  email: c.tjortjis@ihu.edu.gr
  organization: International Hellenic University, Thermi, Thessaloniki, Greece
– sequence: 3
  givenname: Vassilios S.
  orcidid: 0000-0002-9758-0819
  surname: Verykios
  fullname: Verykios, Vassilios S.
  email: verykios@eap.gr
  organization: Hellenic Open University, Patras, Greece
BookMark eNp9kEtPQjEQhRuDiYD-ABMXTVxf7Ou-lgQRjSQmgOumtFMsQou9ZaG_3ouwMC5czSTnnHl8PdTxwQNC15QMKCX13eL5fjxghIkB5wUljJ2hLs3zKmO0pp22J4JmgovyAvWaZk0IqcqKdtFkiOd7lwAHi8fWOu3AJzxT3oSt-wKDh5tViC69bRtsQ8TzFEFtnV_hGegQDZ46_65WcInOrdo0cHWqffT6MF6MHrPpy-RpNJxmmtUiZUvBiqXVhtuqKCkta86M0RX_uQZyaLWKc1q04pIrS7VRLLcccsUrZgrgfXR7nLuL4WMPTZLrsI--XSk5KUTZ_lWz1kWPLh1D00SwchfdVsVPSYk88JIHXvLAS554tZnyT0a7pJILPkXlNv8mb45JBwC_Ngla5XnBvwG5YHkY
CODEN ITKEEH
CitedBy_id crossref_primary_10_1016_j_is_2024_102506
crossref_primary_10_3390_axioms13110795
crossref_primary_10_1109_ACCESS_2025_3599990
Cites_doi 10.1109/TKDE.2014.2349916
10.1007/s10618-018-0563-0
10.1109/BigData50022.2020.9378127
10.1007/s10619-019-07263-0
10.1109/TKDE.2013.54
10.14778/3611479.3611487
10.14778/2733085.2733098
10.14778/1920841.1920898
10.1145/2740908.2742839
10.1109/ICDE51399.2021.00112
10.1109/ICDE.2018.00015
10.1109/BigData52589.2021.9671540
10.14778/2876473.2876474
10.1145/276698.276781
10.1109/TKDE.2014.2359666
10.14778/2732939.2732943
10.1109/TKDE.2012.43
10.1109/TKDE.2011.127
10.14778/2556549.2556567
10.1145/3448016.3457238
10.1145/3341105.3375776
ContentType Journal Article
Copyright Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2024
Copyright_xml – notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2024
DBID 97E
RIA
RIE
AAYXX
CITATION
7SC
7SP
8FD
JQ2
L7M
L~C
L~D
DOI 10.1109/TKDE.2024.3361022
DatabaseName IEEE All-Society Periodicals Package (ASPP) 2005–Present
IEEE All-Society Periodicals Package (ASPP) 1998–Present
IEEE Electronic Library (IEL)
CrossRef
Computer and Information Systems Abstracts
Electronics & Communications Abstracts
Technology Research Database
ProQuest Computer Science Collection
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts – Academic
Computer and Information Systems Abstracts Professional
DatabaseTitle CrossRef
Technology Research Database
Computer and Information Systems Abstracts – Academic
Electronics & Communications Abstracts
ProQuest Computer Science Collection
Computer and Information Systems Abstracts
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts Professional
DatabaseTitleList Technology Research Database

Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
Computer Science
EISSN 1558-2191
EndPage 2813
ExternalDocumentID 10_1109_TKDE_2024_3361022
10418556
Genre orig-research
GroupedDBID -~X
.DC
0R~
1OL
29I
4.4
5GY
5VS
6IK
97E
9M8
AAJGR
AARMG
AASAJ
AAWTH
ABAZT
ABFSI
ABQJQ
ABVLG
ACGFO
ACIWK
AENEX
AETIX
AGQYO
AGSQL
AHBIQ
AI.
AIBXA
AKJIK
AKQYR
ALLEH
ALMA_UNASSIGNED_HOLDINGS
ASUFR
ATWAV
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CS3
DU5
E.L
EBS
EJD
F5P
HZ~
H~9
ICLAB
IEDLZ
IFIPE
IFJZH
IPLJI
JAVBF
LAI
M43
MS~
O9-
OCL
P2P
PQQKQ
RIA
RIE
RNI
RNS
RXW
RZB
TAE
TAF
TN5
UHB
VH1
AAYXX
CITATION
7SC
7SP
8FD
JQ2
L7M
L~C
L~D
ID FETCH-LOGICAL-c294t-b426bfcd3f867117932ddc8308781e5ebfc83316671b3af1cda25f3e5a382d6e3
IEDL.DBID RIE
ISICitedReferencesCount 4
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001245017200013&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 1041-4347
IngestDate Sun Nov 30 04:02:33 EST 2025
Tue Nov 18 21:57:15 EST 2025
Sat Nov 29 02:36:08 EST 2025
Wed Aug 27 02:06:32 EDT 2025
IsPeerReviewed true
IsScholarly true
Issue 7
Language English
License https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html
https://doi.org/10.15223/policy-029
https://doi.org/10.15223/policy-037
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c294t-b426bfcd3f867117932ddc8308781e5ebfc83316671b3af1cda25f3e5a382d6e3
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ORCID 0000-0001-8263-9024
0000-0002-9758-0819
0000-0002-3878-5988
PQID 3064700892
PQPubID 85438
PageCount 11
ParticipantIDs ieee_primary_10418556
crossref_primary_10_1109_TKDE_2024_3361022
proquest_journals_3064700892
crossref_citationtrail_10_1109_TKDE_2024_3361022
PublicationCentury 2000
PublicationDate 2024-07-01
PublicationDateYYYYMMDD 2024-07-01
PublicationDate_xml – month: 07
  year: 2024
  text: 2024-07-01
  day: 01
PublicationDecade 2020
PublicationPlace New York
PublicationPlace_xml – name: New York
PublicationTitle IEEE transactions on knowledge and data engineering
PublicationTitleAbbrev TKDE
PublicationYear 2024
Publisher IEEE
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Publisher_xml – name: IEEE
– name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
References ref13
ref24
ref12
ref23
ref15
ref14
ref20
Gionis (ref21)
ref11
ref22
ref10
Gazzari (ref17)
ref2
ref1
ref16
ref19
Karapiperis (ref3)
ref18
ref8
ref7
ref9
ref4
ref6
ref5
References_xml – ident: ref1
  doi: 10.1109/TKDE.2014.2349916
– ident: ref11
  doi: 10.1007/s10618-018-0563-0
– ident: ref4
  doi: 10.1109/BigData50022.2020.9378127
– ident: ref5
  doi: 10.1007/s10619-019-07263-0
– ident: ref20
  doi: 10.1109/TKDE.2013.54
– ident: ref8
  doi: 10.14778/3611479.3611487
– ident: ref19
  doi: 10.14778/2733085.2733098
– ident: ref9
  doi: 10.14778/1920841.1920898
– ident: ref24
  doi: 10.1145/2740908.2742839
– start-page: 80
  volume-title: Proc. Annu. Int. Conf. Extending Database Technol.
  ident: ref17
  article-title: Progressive entity resolution over incremental data
– ident: ref6
  doi: 10.1109/ICDE51399.2021.00112
– ident: ref15
  doi: 10.1109/ICDE.2018.00015
– ident: ref16
  doi: 10.1109/BigData52589.2021.9671540
– ident: ref14
  doi: 10.14778/2876473.2876474
– ident: ref23
  doi: 10.1145/276698.276781
– ident: ref13
  doi: 10.1109/TKDE.2014.2359666
– ident: ref18
  doi: 10.14778/2732939.2732943
– start-page: 73
  volume-title: Proc. Int. Conf. Extending Database Technol.
  ident: ref3
  article-title: Summarization algorithms for record linkage
– ident: ref12
  doi: 10.1109/TKDE.2012.43
– ident: ref22
  doi: 10.1109/TKDE.2011.127
– ident: ref10
  doi: 10.14778/2556549.2556567
– ident: ref2
  doi: 10.1145/3448016.3457238
– ident: ref7
  doi: 10.1145/3341105.3375776
– start-page: 518
  volume-title: Proc. 25th Int. Conf. Very Large Data Bases
  ident: ref21
  article-title: Similarity search in high dimensions via hashing
SSID ssj0008781
Score 2.4600136
Snippet Organizations leverage massive volumes of information and new types of data to generate unprecedented insights and improve their outcomes. Correctly...
SourceID proquest
crossref
ieee
SourceType Aggregation Database
Enrichment Source
Index Database
Publisher
StartPage 2803
SubjectTerms Algorithms
Approximation algorithms
Blocking
Couplings
Data integration
Data structures
Data transmission
entity resolution
Indexes
randomization
Real-time systems
record linkage
Soft sensors
Streams
Title A Suite of Efficient Randomized Algorithms for Streaming Record Linkage
URI https://ieeexplore.ieee.org/document/10418556
https://www.proquest.com/docview/3064700892
Volume 36
WOSCitedRecordID wos001245017200013&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVIEE
  databaseName: IEEE Electronic Library (IEL)
  customDbUrl:
  eissn: 1558-2191
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0008781
  issn: 1041-4347
  databaseCode: RIE
  dateStart: 19890101
  isFulltext: true
  titleUrlDefault: https://ieeexplore.ieee.org/
  providerName: IEEE
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3NS8MwFH_o8KAHp3PidEoOnoTOtmnT5Dh0UxCG6ITdSpoPHWyt7MODf71J2o2JKHgr5CWE98tL3q8veQ_gUsfGqc0w95gvfC8K_djLNCOeiAImKQtEJKUrNpEMBnQ0Yo_VY3X3FkYp5S6fqY79dLF8WYil_VVmLNymWonJNmwnCSkfa623XZq4iqRWxotwlFQhzMBn18OH256hgmHUwZhYivPtEHJVVX5sxe586df_ObMD2K8cSdQtkT-ELZU3oL4q0oAqm23A3kbGwSO466LnpfEyUaFRz2WPMOOiJ57LYjr-VBJ1J6_FbLx4m86RcWeRDVrzqemKSp6KLHc1W1ATXvq94c29V9VS8ETIooWXmZM400JibRPaWasMpRQUO7WpWJk2inFATKNBTgdC8jDWWMUc01AShY-hlhe5OgGkiDZ-AiU0wDzixDiYmQwU0zxONKcKt8BfKTcVVaJxW-9ikjrC4bPU4pFaPNIKjxZcrbu8l1k2_hJuWgA2BEvdt6C9gjCtDHGeWoKVmMXBwtNfup3Brh29vILbhtpitlTnsCM-FuP57MKtsS-PB8yl
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1JS8QwFH64gXpwGUcc1xw8CR3bpktyHHR0RB1ER_BW0iw64LQyiwd_vUmaGRRR8FZIXlvy5W1Z3gdwrGId1OaYedTnvheFfuzliiYejwIqCA14JIQlm0i7XfL0RO_cZXV7F0ZKaQ-fyaZ5tHv5ouQTs1SmNdyUWomTeVg01FnuutbM8JLUcpKaXl6Eo9RtYgY-Pe1dn7d1MhhGTYwTk-R8c0OWV-WHMbYe5mL9n_-2AWsulEStCvtNmJNFDdanNA3IaW0NVr_UHNyCyxZ6mOg4E5UKtW39CP1edM8KUQ76H1Kg1utzOeyPXwYjpANaZLat2UCLoipTRSZ71UaoDo8X7d5Zx3NsCh4PaTT2cu2Lc8UFVqakndHLUAhOsB02GUvdRjAOEt2osVMBFyyMFZYxwyQUicTbsFCUhdwBJBOlIwWSkACziCU6xMxFIKlicaoYkbgB_nRwM-5KjRvGi9fMphw-zQwemcEjc3g04GQm8lbV2firc90A8KVjNfYN2J9CmDlVHGUmxUr15KDh7i9iR7Dc6d3eZDdX3es9WDFfqg7k7sPCeDiRB7DE38f90fDQzrdPZMPP7g
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=A+Suite+of+Efficient+Randomized+Algorithms+for+Streaming+Record+Linkage&rft.jtitle=IEEE+transactions+on+knowledge+and+data+engineering&rft.au=Karapiperis%2C+Dimitrios&rft.au=Tjortjis%2C+Christos&rft.au=Verykios%2C+Vassilios+S&rft.date=2024-07-01&rft.pub=The+Institute+of+Electrical+and+Electronics+Engineers%2C+Inc.+%28IEEE%29&rft.issn=1041-4347&rft.eissn=1558-2191&rft.volume=36&rft.issue=7&rft.spage=2803&rft_id=info:doi/10.1109%2FTKDE.2024.3361022&rft.externalDBID=NO_FULL_TEXT
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1041-4347&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1041-4347&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1041-4347&client=summon