SparkDQ: Efficient generic big data quality management on distributed data-parallel computation
•A generic big data quality model and programming framework.•A set of distributed data-parallel quality management algorithms.•A Spark-based implementation with two optimization techniques.•Comprehensive performance evaluation of parallel data quality algorithms. In the big data era, large amounts o...
Uložené v:
| Vydané v: | Journal of parallel and distributed computing Ročník 156; s. 132 - 147 |
|---|---|
| Hlavní autori: | , , , , , , |
| Médium: | Journal Article |
| Jazyk: | English |
| Vydavateľské údaje: |
Elsevier Inc
01.10.2021
|
| Predmet: | |
| ISSN: | 0743-7315, 1096-0848 |
| On-line prístup: | Získať plný text |
| Tagy: |
Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
|
| Shrnutí: | •A generic big data quality model and programming framework.•A set of distributed data-parallel quality management algorithms.•A Spark-based implementation with two optimization techniques.•Comprehensive performance evaluation of parallel data quality algorithms.
In the big data era, large amounts of data are under generation and accumulation in various industries. However, users usually feel hindered by the data quality issues when extracting values from the big data. Thus, data quality issues are gaining more and more attention from data quality management analysts. Cutting-edge solutions like data ETL, data cleaning, and data quality monitoring systems have many deficiencies in capability and efficiency, making it difficult to cope with complicated situations on big data. These problems inspire us to build SparkDQ, a generic distributed data quality management model and framework that provides a series of data quality detection and repair interfaces. Users can quickly build custom tasks of data quality computing for various needs by utilizing these interfaces. In addition, SparkDQ implements a set of algorithms that in a parallel manner with optimizations. These algorithms aim at various data quality goals. We also propose several system-level optimizations, including the job-level optimization with multi-task execution scheduling and the data-level optimization with data state caching. The experimental evaluation shows that the proposed distributed algorithms in SparkDQ run up to 12 times faster compared to the corresponding stand-alone serial and multi-thread algorithms. Compared with the cutting-edge distributed data quality solution Apache Griffin, SparkDQ has more features, and its execution time is only around half of Apache Griffin on average. SparkDQ achieves near-linear data and node scalability. |
|---|---|
| ISSN: | 0743-7315 1096-0848 |
| DOI: | 10.1016/j.jpdc.2021.05.012 |