Comparative Analysis of Skew-Join Strategies for Large-Scale Datasets with MapReduce and Spark

In the era of data deluge, Big Data gradually offers numerous opportunities, but also poses significant challenges to conventional data processing and analysis methods. MapReduce has become a prominent parallel and distributed programming model for efficiently handling such massive datasets. One of...

Celý popis

Uložené v:

Podrobná bibliografia
Vydané v:	Applied sciences Ročník 12; číslo 13; s. 6554
Hlavní autori:	Phan, Anh-Cang, Phan, Thuong-Cang, Cao, Hung-Phi, Trieu, Thanh-Ngoan
Médium:	Journal Article
Jazyk:	English
Vydavateľské údaje:	Basel MDPI AG 01.07.2022 Multidisciplinary digital publishing institute (MDPI)
Predmet:	Algorithms Apache Spark Big Data big data analytics Data processing Datasets Distributed processing Engineering Sciences MapReduce Queries skew join big data analytics skew join Apache Spark MapReduce
ISSN:	2076-3417, 2076-3417
On-line prístup:	Získať plný text
Tagy:	Pridať tag Žiadne tagy, Buďte prvý, kto otaguje tento záznam!

Popis
Shrnutí:	In the era of data deluge, Big Data gradually offers numerous opportunities, but also poses significant challenges to conventional data processing and analysis methods. MapReduce has become a prominent parallel and distributed programming model for efficiently handling such massive datasets. One of the most elementary and extensive operations in MapReduce is the join operation. These joins have become ever more complex and expensive in the context of skewed data, in which some common join keys appear with a greater frequency than others. Some of the reduction tasks processing these join keys will finish later than others; thus, the benefits of parallel computation become meaningless. Some studies on the problem of skew joins have been conducted, but an adequate and systematic comparison in the Spark environment has not been presented. They have only provided experimental tests, so there is still a shortage of representations of mathematical models on which skew-join algorithms can be compared. This study is, therefore, designed to provide the theoretical and practical basics for evaluating skew-join strategies for large-scale datasets with MapReduce and Spark—both analytically with cost models and practically with experiments. The objectives of the study are, first, to present the implementation of prominent skew-join algorithms in Spark, second, to evaluate the algorithms by using cost models and experiments, and third, to show the advantages and disadvantages of each one and to recommend strategies for the better use of skew joins in Spark.
Bibliografia:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	2076-3417 2076-3417
DOI:	10.3390/app12136554