A Suite of Efficient Randomized Algorithms for Streaming Record Linkage

Organizations leverage massive volumes of information and new types of data to generate unprecedented insights and improve their outcomes. Correctly identifying duplicate records that represent the same entity, such as user, customer, patient and so on, a process commonly known as record linkage, ca...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on knowledge and data engineering Vol. 36; no. 7; pp. 2803 - 2813
Main Authors: Karapiperis, Dimitrios, Tjortjis, Christos, Verykios, Vassilios S.
Format: Journal Article
Language:English
Published: New York IEEE 01.07.2024
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects:
ISSN:1041-4347, 1558-2191
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Organizations leverage massive volumes of information and new types of data to generate unprecedented insights and improve their outcomes. Correctly identifying duplicate records that represent the same entity, such as user, customer, patient and so on, a process commonly known as record linkage, can improve service levels, accelerate sales, or elevate healthcare decision support. Towards this direction, blocking methods are used with the aim to group matching records in the same block using a combination of their attributes as blocking keys. This paper introduces a suite of randomized algorithms specifically crafted for streaming record linkage settings. Using a bounded in-memory data structure, in terms of the number of blocks and positions within each block, our algorithms guarantee that the most frequently accessed and the most recently used blocks remain in main memory and, additionally, the records within a block are renewed on a rolling basis. The operation of our algorithms rely on simple random choices, instead of utilizing cumbersome sorting data structures, which ensure that the probability of inactive blocks and older records to remain in main memory decays in order to free space for more promising blocks and fresher records, respectively. We also introduce an algorithm that performs approximate blocking to tackle the problem of misspellings and typos present in the blocking keys. The experimental evaluation showcases that our proposed algorithms scale efficiently to data streams by providing certain accuracy guarantees.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:1041-4347
1558-2191
DOI:10.1109/TKDE.2024.3361022