A Scalable Parallel Processing Design for the Data Washing Machine: An Unsupervised Entity Resolution System

Saved in:
Bibliographic Details
Title: A Scalable Parallel Processing Design for the Data Washing Machine: An Unsupervised Entity Resolution System
Language: English
Authors: Nicholas Kofi Akortia Hagan
Source: ProQuest LLC. 2024Ph.D. Dissertation, University of Arkansas at Little Rock.
Availability: ProQuest LLC. 789 East Eisenhower Parkway, P.O. Box 1346, Ann Arbor, MI 48106. Tel: 800-521-0600; Web site: http://www.proquest.com/en-US/products/dissertations/individuals.shtml
Peer Reviewed: N
Page Count: 112
Publication Date: 2024
Sponsoring Agency: National Science Foundation (NSF), Office of Integrative Activities (OIA)
Contract Number: 1946391
Document Type: Dissertations/Theses - Doctoral Dissertations
Descriptors: Information Systems, Multivariate Analysis, Information Management, Computer System Design, Information Technology
ISBN: 979-83-8353-080-1
Abstract: Entity Resolution (ER) has been one of the bedrocks in the creation of information systems by ensuring ambiguous entities are identified and resolved by linking. One common design approach of traditional ER systems is to run in single-threaded mode, which makes the system prone to out-of-memory error when processing larger datasets. The Data Washing Machine (DWM) as a proof-of-concept of an unsupervised cluster ER system is indifferent from this common design bottleneck. The original prototype design of the DWM requires shared memory tables and dictionaries of tokens, and its single-threaded nature makes it not scalable, hence not viable for real-world application. Distributed and parallel programming frameworks such as Hadoop MapReduce (MR) and Apache Spark's Resilient Distributed Datasets (RDD) are great fits for scaling ER systems since the comparison of equivalent pairs is independent and can occur in parallel. This dissertation aims at designing and developing a Distributed DWM by adopting the parallel and distributed capability of Hadoop MR and RDD. An initial prototype (HadoopDWM) was developed using Hadoop MR, which was further refactored into SparkDWM using RDD. Experiment results show that HadoopDWM and SparkDWM get the same results as the legacy DWM using optimal starting parameters. A scalability test conducted using 203 million records confirms that HadoopDWM and SparkDWM are scalable, with a total execution time of 7 and 3 hours, respectively. [The dissertation citations contained here are published with the permission of ProQuest LLC. Further reproduction is prohibited without permission. Copies of dissertations may be obtained by Telephone (800) 1-800-521-0600. Web page: http://www.proquest.com/en-US/products/dissertations/individuals.shtml.]
Abstractor: As Provided
Entry Date: 2024
Access URL: https://gateway.proquest.com/openurl?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation&res_dat=xri:pqm&rft_dat=xri:pqdiss:31484459
Accession Number: ED659087
Database: ERIC
FullText Text:
  Availability: 0
Header DbId: eric
DbLabel: ERIC
An: ED659087
AccessLevel: 3
PubType: Dissertation/ Thesis
PubTypeId: dissertation
PreciseRelevancyScore: 0
IllustrationInfo
Items – Name: Title
  Label: Title
  Group: Ti
  Data: A Scalable Parallel Processing Design for the Data Washing Machine: An Unsupervised Entity Resolution System
– Name: Language
  Label: Language
  Group: Lang
  Data: English
– Name: Author
  Label: Authors
  Group: Au
  Data: <searchLink fieldCode="AR" term="%22Nicholas+Kofi+Akortia+Hagan%22">Nicholas Kofi Akortia Hagan</searchLink>
– Name: TitleSource
  Label: Source
  Group: Src
  Data: <searchLink fieldCode="SO" term="%22ProQuest+LLC%22"><i>ProQuest LLC</i></searchLink>. 2024Ph.D. Dissertation, University of Arkansas at Little Rock.
– Name: Avail
  Label: Availability
  Group: Avail
  Data: ProQuest LLC. 789 East Eisenhower Parkway, P.O. Box 1346, Ann Arbor, MI 48106. Tel: 800-521-0600; Web site: http://www.proquest.com/en-US/products/dissertations/individuals.shtml
– Name: PeerReviewed
  Label: Peer Reviewed
  Group: SrcInfo
  Data: N
– Name: Pages
  Label: Page Count
  Group: Src
  Data: 112
– Name: DatePubCY
  Label: Publication Date
  Group: Date
  Data: 2024
– Name: SourceSuprt
  Label: Sponsoring Agency
  Group: SrcSuprt
  Data: National Science Foundation (NSF), Office of Integrative Activities (OIA)
– Name: NumberContract
  Label: Contract Number
  Group: NumCntrct
  Data: 1946391
– Name: TypeDocument
  Label: Document Type
  Group: TypDoc
  Data: Dissertations/Theses - Doctoral Dissertations
– Name: Subject
  Label: Descriptors
  Group: Su
  Data: <searchLink fieldCode="DE" term="%22Information+Systems%22">Information Systems</searchLink><br /><searchLink fieldCode="DE" term="%22Multivariate+Analysis%22">Multivariate Analysis</searchLink><br /><searchLink fieldCode="DE" term="%22Information+Management%22">Information Management</searchLink><br /><searchLink fieldCode="DE" term="%22Computer+System+Design%22">Computer System Design</searchLink><br /><searchLink fieldCode="DE" term="%22Information+Technology%22">Information Technology</searchLink>
– Name: ISBN
  Label: ISBN
  Group: ISBN
  Data: 979-83-8353-080-1
– Name: Abstract
  Label: Abstract
  Group: Ab
  Data: Entity Resolution (ER) has been one of the bedrocks in the creation of information systems by ensuring ambiguous entities are identified and resolved by linking. One common design approach of traditional ER systems is to run in single-threaded mode, which makes the system prone to out-of-memory error when processing larger datasets. The Data Washing Machine (DWM) as a proof-of-concept of an unsupervised cluster ER system is indifferent from this common design bottleneck. The original prototype design of the DWM requires shared memory tables and dictionaries of tokens, and its single-threaded nature makes it not scalable, hence not viable for real-world application. Distributed and parallel programming frameworks such as Hadoop MapReduce (MR) and Apache Spark's Resilient Distributed Datasets (RDD) are great fits for scaling ER systems since the comparison of equivalent pairs is independent and can occur in parallel. This dissertation aims at designing and developing a Distributed DWM by adopting the parallel and distributed capability of Hadoop MR and RDD. An initial prototype (HadoopDWM) was developed using Hadoop MR, which was further refactored into SparkDWM using RDD. Experiment results show that HadoopDWM and SparkDWM get the same results as the legacy DWM using optimal starting parameters. A scalability test conducted using 203 million records confirms that HadoopDWM and SparkDWM are scalable, with a total execution time of 7 and 3 hours, respectively. [The dissertation citations contained here are published with the permission of ProQuest LLC. Further reproduction is prohibited without permission. Copies of dissertations may be obtained by Telephone (800) 1-800-521-0600. Web page: http://www.proquest.com/en-US/products/dissertations/individuals.shtml.]
– Name: AbstractInfo
  Label: Abstractor
  Group: Ab
  Data: As Provided
– Name: DateEntry
  Label: Entry Date
  Group: Date
  Data: 2024
– Name: URL
  Label: Access URL
  Group: URL
  Data: <link linkTarget="URL" linkTerm="https://gateway.proquest.com/openurl?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation&res_dat=xri:pqm&rft_dat=xri:pqdiss:31484459" linkWindow="_blank">https://gateway.proquest.com/openurl?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation&res_dat=xri:pqm&rft_dat=xri:pqdiss:31484459</link>
– Name: AN
  Label: Accession Number
  Group: ID
  Data: ED659087
PLink https://erproxy.cvtisr.sk/sfx/access?url=https://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=eric&AN=ED659087
RecordInfo BibRecord:
  BibEntity:
    Languages:
      – Text: English
    PhysicalDescription:
      Pagination:
        PageCount: 112
    Subjects:
      – SubjectFull: Information Systems
        Type: general
      – SubjectFull: Multivariate Analysis
        Type: general
      – SubjectFull: Information Management
        Type: general
      – SubjectFull: Computer System Design
        Type: general
      – SubjectFull: Information Technology
        Type: general
    Titles:
      – TitleFull: A Scalable Parallel Processing Design for the Data Washing Machine: An Unsupervised Entity Resolution System
        Type: main
  BibRelationships:
    HasContributorRelationships:
      – PersonEntity:
          Name:
            NameFull: Nicholas Kofi Akortia Hagan
    IsPartOfRelationships:
      – BibEntity:
          Dates:
            – D: 01
              M: 01
              Type: published
              Y: 2024
          Identifiers:
            – Type: isbn-print
              Value: 979-83-8353-080-1
          Titles:
            – TitleFull: ProQuest LLC
              Type: main
ResultId 1