Handling data skew in join algorithms using MapReduce
•We introduce a skew handling algorithm, called multi-dimensional range partitioning.•The proposed algorithm is more efficient than traditional MapReduce-based join algorithms.•The proposed algorithm is scalable regardless of the size of input data. One of the major obstacles hindering effective joi...
Uloženo v:
| Vydáno v: | Expert systems with applications Ročník 51; s. 286 - 299 |
|---|---|
| Hlavní autoři: | , , , |
| Médium: | Journal Article |
| Jazyk: | angličtina |
| Vydáno: |
Elsevier Ltd
01.06.2016
|
| Témata: | |
| ISSN: | 0957-4174, 1873-6793 |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Shrnutí: | •We introduce a skew handling algorithm, called multi-dimensional range partitioning.•The proposed algorithm is more efficient than traditional MapReduce-based join algorithms.•The proposed algorithm is scalable regardless of the size of input data.
One of the major obstacles hindering effective join processing on MapReduce is data skew. Since MapReduce’s basic hash-based partitioning method cannot solve the problem properly, two alternatives have been proposed: range-based and randomized methods. However, they still remain some drawbacks: the range-based method does not handle join product skew, and the randomized method performs worse than the basic hash-based partitioning when input relations are not skewed. In this paper, we present a new skew handling method, called multi-dimensional range partitioning (MDRP). The proposed method overcomes the limitations of traditional algorithms in two ways: 1) the number of output records expected at each machine is considered, which leads to better handling of join product skew, and 2) a small number of input records are sampled before the actual join begins so that an efficient execution plan considering the degree of data skew can be created. As a result, in a scalar skew experiment, the proposed join algorithm is about 6.76 times faster than the range-based algorithm when join product skew exists and about 5.14 times than the randomized algorithm when input relations are not skewed. Moreover, through the worst-case analysis, we show that the input and the output imbalances are less than or equal to 2. The proposed algorithm does not require any modification to the original MapReduce environment and is applicable to complex join operations such as theta-joins and multi-way joins. |
|---|---|
| Bibliografie: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 |
| ISSN: | 0957-4174 1873-6793 |
| DOI: | 10.1016/j.eswa.2015.12.024 |