SparkDQ: Efficient generic big data quality management on distributed data-parallel computation

•A generic big data quality model and programming framework.•A set of distributed data-parallel quality management algorithms.•A Spark-based implementation with two optimization techniques.•Comprehensive performance evaluation of parallel data quality algorithms. In the big data era, large amounts o...

Celý popis

Uložené v:

Podrobná bibliografia
Vydané v:	Journal of parallel and distributed computing Ročník 156; s. 132 - 147
Hlavní autori:	Gu, Rong, Qi, Yang, Wu, Tongyu, Wang, Zhaokang, Xu, Xiaolong, Yuan, Chunfeng, Huang, Yihua
Médium:	Journal Article
Jazyk:	English
Vydavateľské údaje:	Elsevier Inc 01.10.2021
Predmet:	Big data Data quality management system Distributed system Multi-tasks scheduling Parallel data quality algorithms Data quality management system Big data Distributed system Multi-tasks scheduling Parallel data quality algorithms
ISSN:	0743-7315, 1096-0848
On-line prístup:	Získať plný text
Tagy:	Pridať tag Žiadne tagy, Buďte prvý, kto otaguje tento záznam!

Popis
Shrnutí:	•A generic big data quality model and programming framework.•A set of distributed data-parallel quality management algorithms.•A Spark-based implementation with two optimization techniques.•Comprehensive performance evaluation of parallel data quality algorithms. In the big data era, large amounts of data are under generation and accumulation in various industries. However, users usually feel hindered by the data quality issues when extracting values from the big data. Thus, data quality issues are gaining more and more attention from data quality management analysts. Cutting-edge solutions like data ETL, data cleaning, and data quality monitoring systems have many deficiencies in capability and efficiency, making it difficult to cope with complicated situations on big data. These problems inspire us to build SparkDQ, a generic distributed data quality management model and framework that provides a series of data quality detection and repair interfaces. Users can quickly build custom tasks of data quality computing for various needs by utilizing these interfaces. In addition, SparkDQ implements a set of algorithms that in a parallel manner with optimizations. These algorithms aim at various data quality goals. We also propose several system-level optimizations, including the job-level optimization with multi-task execution scheduling and the data-level optimization with data state caching. The experimental evaluation shows that the proposed distributed algorithms in SparkDQ run up to 12 times faster compared to the corresponding stand-alone serial and multi-thread algorithms. Compared with the cutting-edge distributed data quality solution Apache Griffin, SparkDQ has more features, and its execution time is only around half of Apache Griffin on average. SparkDQ achieves near-linear data and node scalability.
ISSN:	0743-7315 1096-0848
DOI:	10.1016/j.jpdc.2021.05.012