A Scalable Parallel Processing Design for the Data Washing Machine: An Unsupervised Entity Resolution System

Gespeichert in:
Bibliographische Detailangaben
Titel: A Scalable Parallel Processing Design for the Data Washing Machine: An Unsupervised Entity Resolution System
Sprache: English
Autoren: Nicholas Kofi Akortia Hagan
Quelle: ProQuest LLC. 2024Ph.D. Dissertation, University of Arkansas at Little Rock.
Verfügbarkeit: ProQuest LLC. 789 East Eisenhower Parkway, P.O. Box 1346, Ann Arbor, MI 48106. Tel: 800-521-0600; Web site: http://www.proquest.com/en-US/products/dissertations/individuals.shtml
Peer Reviewed: N
Page Count: 112
Publikationsdatum: 2024
Sponsoring Agency: National Science Foundation (NSF), Office of Integrative Activities (OIA)
Contract Number: 1946391
Publikationsart: Dissertations/Theses - Doctoral Dissertations
Descriptors: Information Systems, Multivariate Analysis, Information Management, Computer System Design, Information Technology
ISBN: 979-83-8353-080-1
Abstract: Entity Resolution (ER) has been one of the bedrocks in the creation of information systems by ensuring ambiguous entities are identified and resolved by linking. One common design approach of traditional ER systems is to run in single-threaded mode, which makes the system prone to out-of-memory error when processing larger datasets. The Data Washing Machine (DWM) as a proof-of-concept of an unsupervised cluster ER system is indifferent from this common design bottleneck. The original prototype design of the DWM requires shared memory tables and dictionaries of tokens, and its single-threaded nature makes it not scalable, hence not viable for real-world application. Distributed and parallel programming frameworks such as Hadoop MapReduce (MR) and Apache Spark's Resilient Distributed Datasets (RDD) are great fits for scaling ER systems since the comparison of equivalent pairs is independent and can occur in parallel. This dissertation aims at designing and developing a Distributed DWM by adopting the parallel and distributed capability of Hadoop MR and RDD. An initial prototype (HadoopDWM) was developed using Hadoop MR, which was further refactored into SparkDWM using RDD. Experiment results show that HadoopDWM and SparkDWM get the same results as the legacy DWM using optimal starting parameters. A scalability test conducted using 203 million records confirms that HadoopDWM and SparkDWM are scalable, with a total execution time of 7 and 3 hours, respectively. [The dissertation citations contained here are published with the permission of ProQuest LLC. Further reproduction is prohibited without permission. Copies of dissertations may be obtained by Telephone (800) 1-800-521-0600. Web page: http://www.proquest.com/en-US/products/dissertations/individuals.shtml.]
Abstractor: As Provided
Entry Date: 2024
Zugangs-URL: https://gateway.proquest.com/openurl?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation&res_dat=xri:pqm&rft_dat=xri:pqdiss:31484459
Dokumentencode: ED659087
Datenbank: ERIC
Beschreibung
Abstract:Entity Resolution (ER) has been one of the bedrocks in the creation of information systems by ensuring ambiguous entities are identified and resolved by linking. One common design approach of traditional ER systems is to run in single-threaded mode, which makes the system prone to out-of-memory error when processing larger datasets. The Data Washing Machine (DWM) as a proof-of-concept of an unsupervised cluster ER system is indifferent from this common design bottleneck. The original prototype design of the DWM requires shared memory tables and dictionaries of tokens, and its single-threaded nature makes it not scalable, hence not viable for real-world application. Distributed and parallel programming frameworks such as Hadoop MapReduce (MR) and Apache Spark's Resilient Distributed Datasets (RDD) are great fits for scaling ER systems since the comparison of equivalent pairs is independent and can occur in parallel. This dissertation aims at designing and developing a Distributed DWM by adopting the parallel and distributed capability of Hadoop MR and RDD. An initial prototype (HadoopDWM) was developed using Hadoop MR, which was further refactored into SparkDWM using RDD. Experiment results show that HadoopDWM and SparkDWM get the same results as the legacy DWM using optimal starting parameters. A scalability test conducted using 203 million records confirms that HadoopDWM and SparkDWM are scalable, with a total execution time of 7 and 3 hours, respectively. [The dissertation citations contained here are published with the permission of ProQuest LLC. Further reproduction is prohibited without permission. Copies of dissertations may be obtained by Telephone (800) 1-800-521-0600. Web page: http://www.proquest.com/en-US/products/dissertations/individuals.shtml.]
ISBN:979-83-8353-080-1