A Scalable Parallel Processing Design for the Data Washing Machine: An Unsupervised Entity Resolution System
Saved in:
| Title: | A Scalable Parallel Processing Design for the Data Washing Machine: An Unsupervised Entity Resolution System |
|---|---|
| Language: | English |
| Authors: | Nicholas Kofi Akortia Hagan |
| Source: | ProQuest LLC. 2024Ph.D. Dissertation, University of Arkansas at Little Rock. |
| Availability: | ProQuest LLC. 789 East Eisenhower Parkway, P.O. Box 1346, Ann Arbor, MI 48106. Tel: 800-521-0600; Web site: http://www.proquest.com/en-US/products/dissertations/individuals.shtml |
| Peer Reviewed: | N |
| Page Count: | 112 |
| Publication Date: | 2024 |
| Sponsoring Agency: | National Science Foundation (NSF), Office of Integrative Activities (OIA) |
| Contract Number: | 1946391 |
| Document Type: | Dissertations/Theses - Doctoral Dissertations |
| Descriptors: | Information Systems, Multivariate Analysis, Information Management, Computer System Design, Information Technology |
| ISBN: | 979-83-8353-080-1 |
| Abstract: | Entity Resolution (ER) has been one of the bedrocks in the creation of information systems by ensuring ambiguous entities are identified and resolved by linking. One common design approach of traditional ER systems is to run in single-threaded mode, which makes the system prone to out-of-memory error when processing larger datasets. The Data Washing Machine (DWM) as a proof-of-concept of an unsupervised cluster ER system is indifferent from this common design bottleneck. The original prototype design of the DWM requires shared memory tables and dictionaries of tokens, and its single-threaded nature makes it not scalable, hence not viable for real-world application. Distributed and parallel programming frameworks such as Hadoop MapReduce (MR) and Apache Spark's Resilient Distributed Datasets (RDD) are great fits for scaling ER systems since the comparison of equivalent pairs is independent and can occur in parallel. This dissertation aims at designing and developing a Distributed DWM by adopting the parallel and distributed capability of Hadoop MR and RDD. An initial prototype (HadoopDWM) was developed using Hadoop MR, which was further refactored into SparkDWM using RDD. Experiment results show that HadoopDWM and SparkDWM get the same results as the legacy DWM using optimal starting parameters. A scalability test conducted using 203 million records confirms that HadoopDWM and SparkDWM are scalable, with a total execution time of 7 and 3 hours, respectively. [The dissertation citations contained here are published with the permission of ProQuest LLC. Further reproduction is prohibited without permission. Copies of dissertations may be obtained by Telephone (800) 1-800-521-0600. Web page: http://www.proquest.com/en-US/products/dissertations/individuals.shtml.] |
| Abstractor: | As Provided |
| Entry Date: | 2024 |
| Access URL: | https://gateway.proquest.com/openurl?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation&res_dat=xri:pqm&rft_dat=xri:pqdiss:31484459 |
| Accession Number: | ED659087 |
| Database: | ERIC |
| FullText | Text: Availability: 0 |
|---|---|
| Header | DbId: eric DbLabel: ERIC An: ED659087 AccessLevel: 3 PubType: Dissertation/ Thesis PubTypeId: dissertation PreciseRelevancyScore: 0 |
| IllustrationInfo | |
| Items | – Name: Title Label: Title Group: Ti Data: A Scalable Parallel Processing Design for the Data Washing Machine: An Unsupervised Entity Resolution System – Name: Language Label: Language Group: Lang Data: English – Name: Author Label: Authors Group: Au Data: <searchLink fieldCode="AR" term="%22Nicholas+Kofi+Akortia+Hagan%22">Nicholas Kofi Akortia Hagan</searchLink> – Name: TitleSource Label: Source Group: Src Data: <searchLink fieldCode="SO" term="%22ProQuest+LLC%22"><i>ProQuest LLC</i></searchLink>. 2024Ph.D. Dissertation, University of Arkansas at Little Rock. – Name: Avail Label: Availability Group: Avail Data: ProQuest LLC. 789 East Eisenhower Parkway, P.O. Box 1346, Ann Arbor, MI 48106. Tel: 800-521-0600; Web site: http://www.proquest.com/en-US/products/dissertations/individuals.shtml – Name: PeerReviewed Label: Peer Reviewed Group: SrcInfo Data: N – Name: Pages Label: Page Count Group: Src Data: 112 – Name: DatePubCY Label: Publication Date Group: Date Data: 2024 – Name: SourceSuprt Label: Sponsoring Agency Group: SrcSuprt Data: National Science Foundation (NSF), Office of Integrative Activities (OIA) – Name: NumberContract Label: Contract Number Group: NumCntrct Data: 1946391 – Name: TypeDocument Label: Document Type Group: TypDoc Data: Dissertations/Theses - Doctoral Dissertations – Name: Subject Label: Descriptors Group: Su Data: <searchLink fieldCode="DE" term="%22Information+Systems%22">Information Systems</searchLink><br /><searchLink fieldCode="DE" term="%22Multivariate+Analysis%22">Multivariate Analysis</searchLink><br /><searchLink fieldCode="DE" term="%22Information+Management%22">Information Management</searchLink><br /><searchLink fieldCode="DE" term="%22Computer+System+Design%22">Computer System Design</searchLink><br /><searchLink fieldCode="DE" term="%22Information+Technology%22">Information Technology</searchLink> – Name: ISBN Label: ISBN Group: ISBN Data: 979-83-8353-080-1 – Name: Abstract Label: Abstract Group: Ab Data: Entity Resolution (ER) has been one of the bedrocks in the creation of information systems by ensuring ambiguous entities are identified and resolved by linking. One common design approach of traditional ER systems is to run in single-threaded mode, which makes the system prone to out-of-memory error when processing larger datasets. The Data Washing Machine (DWM) as a proof-of-concept of an unsupervised cluster ER system is indifferent from this common design bottleneck. The original prototype design of the DWM requires shared memory tables and dictionaries of tokens, and its single-threaded nature makes it not scalable, hence not viable for real-world application. Distributed and parallel programming frameworks such as Hadoop MapReduce (MR) and Apache Spark's Resilient Distributed Datasets (RDD) are great fits for scaling ER systems since the comparison of equivalent pairs is independent and can occur in parallel. This dissertation aims at designing and developing a Distributed DWM by adopting the parallel and distributed capability of Hadoop MR and RDD. An initial prototype (HadoopDWM) was developed using Hadoop MR, which was further refactored into SparkDWM using RDD. Experiment results show that HadoopDWM and SparkDWM get the same results as the legacy DWM using optimal starting parameters. A scalability test conducted using 203 million records confirms that HadoopDWM and SparkDWM are scalable, with a total execution time of 7 and 3 hours, respectively. [The dissertation citations contained here are published with the permission of ProQuest LLC. Further reproduction is prohibited without permission. Copies of dissertations may be obtained by Telephone (800) 1-800-521-0600. Web page: http://www.proquest.com/en-US/products/dissertations/individuals.shtml.] – Name: AbstractInfo Label: Abstractor Group: Ab Data: As Provided – Name: DateEntry Label: Entry Date Group: Date Data: 2024 – Name: URL Label: Access URL Group: URL Data: <link linkTarget="URL" linkTerm="https://gateway.proquest.com/openurl?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation&res_dat=xri:pqm&rft_dat=xri:pqdiss:31484459" linkWindow="_blank">https://gateway.proquest.com/openurl?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation&res_dat=xri:pqm&rft_dat=xri:pqdiss:31484459</link> – Name: AN Label: Accession Number Group: ID Data: ED659087 |
| PLink | https://erproxy.cvtisr.sk/sfx/access?url=https://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=eric&AN=ED659087 |
| RecordInfo | BibRecord: BibEntity: Languages: – Text: English PhysicalDescription: Pagination: PageCount: 112 Subjects: – SubjectFull: Information Systems Type: general – SubjectFull: Multivariate Analysis Type: general – SubjectFull: Information Management Type: general – SubjectFull: Computer System Design Type: general – SubjectFull: Information Technology Type: general Titles: – TitleFull: A Scalable Parallel Processing Design for the Data Washing Machine: An Unsupervised Entity Resolution System Type: main BibRelationships: HasContributorRelationships: – PersonEntity: Name: NameFull: Nicholas Kofi Akortia Hagan IsPartOfRelationships: – BibEntity: Dates: – D: 01 M: 01 Type: published Y: 2024 Identifiers: – Type: isbn-print Value: 979-83-8353-080-1 Titles: – TitleFull: ProQuest LLC Type: main |
| ResultId | 1 |