PAC: A monitoring framework for performance analysis of compression algorithms in Spark

In Spark, a massive amount of immediate data inevitably leads to excessive I/O overhead. To mitigate this issue, Spark incorporates four compression algorithms to reduce the size of the data for better performance. However, compression and decompression only constitute a portion of the overall logic...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Future generation computer systems Jg. 157; S. 237 - 249
Hauptverfasser:	Zhu, Changpeng, Han, Bo, Li, Gang
Format:	Journal Article
Sprache:	Englisch
Veröffentlicht:	Elsevier B.V 01.08.2024
Schlagworte:	Compression algorithms Log analysis Performance comparison Root-cause analysis Spark Log analysis Performance comparison Root-cause analysis Spark Compression algorithms
ISSN:	0167-739X, 1872-7115
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	In Spark, a massive amount of immediate data inevitably leads to excessive I/O overhead. To mitigate this issue, Spark incorporates four compression algorithms to reduce the size of the data for better performance. However, compression and decompression only constitute a portion of the overall logical flows of Spark applications. This indicates a potential considerable interaction between compression algorithms and Spark applications regarding performance. Consequently, identifying factors that significantly impact the performance of compression algorithms in Spark, and subsequently, determining the actual performance benefits these algorithms provide to Spark applications, remains a significant challenge. To address the challenge, this paper presents a monitoring framework, named PAC, for conducting in-depth and systematic performance analysis of compression algorithms in Spark. As the pioneer of such monitoring frameworks, PAC is built on top of Spark core and collaborates with multiple monitors to collect various types of performance metrics of compressors, correlates and integrates them into structured tuples by the data transformer in PAC. This makes it easier to diagnosis of factors that have a significant influence on the performance of compression algorithms in Spark. Upon utilizing PAC, our experiments reveal that new determinants include the input/output data sizes and types of compression/decompression invocations, CPU consumption for compressing a massive amount of data, and hardware utilization, besides traditional determinants. Moreover, these experiments demonstrate that ZSTD is more susceptible to performance issues when compressing and decompressing small data, despite the overall input and output data being huge. In terms of performance, LZ4 serves as a viable alternative to ZSTD. These findings not only benefit researchers and developers in making more informed decisions in terms of configuring and tuning Spark execution environments but also sustainably boost the optimization of compression algorithms for Spark. •A monitoring framework, named PAC, is proposed for performance analysis of compression algorithms in Spark.•An approach is proposed to integrate Zlib into Spark, resulting in up to a 27% performance enhancement for Spark applications in Zlib.•Root causes of the performance differences among compression algorithms in Spark are analyzed by PAC.•The research results offer significant potential for optimizing compression algorithms in Spark and for configuring and tuning Spark execution environments.
ISSN:	0167-739X 1872-7115
DOI:	10.1016/j.future.2024.02.009