PAC: A monitoring framework for performance analysis of compression algorithms in Spark

In Spark, a massive amount of immediate data inevitably leads to excessive I/O overhead. To mitigate this issue, Spark incorporates four compression algorithms to reduce the size of the data for better performance. However, compression and decompression only constitute a portion of the overall logic...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Future generation computer systems Jg. 157; S. 237 - 249
Hauptverfasser: Zhu, Changpeng, Han, Bo, Li, Gang
Format: Journal Article
Sprache:Englisch
Veröffentlicht: Elsevier B.V 01.08.2024
Schlagworte:
ISSN:0167-739X, 1872-7115
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Abstract In Spark, a massive amount of immediate data inevitably leads to excessive I/O overhead. To mitigate this issue, Spark incorporates four compression algorithms to reduce the size of the data for better performance. However, compression and decompression only constitute a portion of the overall logical flows of Spark applications. This indicates a potential considerable interaction between compression algorithms and Spark applications regarding performance. Consequently, identifying factors that significantly impact the performance of compression algorithms in Spark, and subsequently, determining the actual performance benefits these algorithms provide to Spark applications, remains a significant challenge. To address the challenge, this paper presents a monitoring framework, named PAC, for conducting in-depth and systematic performance analysis of compression algorithms in Spark. As the pioneer of such monitoring frameworks, PAC is built on top of Spark core and collaborates with multiple monitors to collect various types of performance metrics of compressors, correlates and integrates them into structured tuples by the data transformer in PAC. This makes it easier to diagnosis of factors that have a significant influence on the performance of compression algorithms in Spark. Upon utilizing PAC, our experiments reveal that new determinants include the input/output data sizes and types of compression/decompression invocations, CPU consumption for compressing a massive amount of data, and hardware utilization, besides traditional determinants. Moreover, these experiments demonstrate that ZSTD is more susceptible to performance issues when compressing and decompressing small data, despite the overall input and output data being huge. In terms of performance, LZ4 serves as a viable alternative to ZSTD. These findings not only benefit researchers and developers in making more informed decisions in terms of configuring and tuning Spark execution environments but also sustainably boost the optimization of compression algorithms for Spark. •A monitoring framework, named PAC, is proposed for performance analysis of compression algorithms in Spark.•An approach is proposed to integrate Zlib into Spark, resulting in up to a 27% performance enhancement for Spark applications in Zlib.•Root causes of the performance differences among compression algorithms in Spark are analyzed by PAC.•The research results offer significant potential for optimizing compression algorithms in Spark and for configuring and tuning Spark execution environments.
AbstractList In Spark, a massive amount of immediate data inevitably leads to excessive I/O overhead. To mitigate this issue, Spark incorporates four compression algorithms to reduce the size of the data for better performance. However, compression and decompression only constitute a portion of the overall logical flows of Spark applications. This indicates a potential considerable interaction between compression algorithms and Spark applications regarding performance. Consequently, identifying factors that significantly impact the performance of compression algorithms in Spark, and subsequently, determining the actual performance benefits these algorithms provide to Spark applications, remains a significant challenge. To address the challenge, this paper presents a monitoring framework, named PAC, for conducting in-depth and systematic performance analysis of compression algorithms in Spark. As the pioneer of such monitoring frameworks, PAC is built on top of Spark core and collaborates with multiple monitors to collect various types of performance metrics of compressors, correlates and integrates them into structured tuples by the data transformer in PAC. This makes it easier to diagnosis of factors that have a significant influence on the performance of compression algorithms in Spark. Upon utilizing PAC, our experiments reveal that new determinants include the input/output data sizes and types of compression/decompression invocations, CPU consumption for compressing a massive amount of data, and hardware utilization, besides traditional determinants. Moreover, these experiments demonstrate that ZSTD is more susceptible to performance issues when compressing and decompressing small data, despite the overall input and output data being huge. In terms of performance, LZ4 serves as a viable alternative to ZSTD. These findings not only benefit researchers and developers in making more informed decisions in terms of configuring and tuning Spark execution environments but also sustainably boost the optimization of compression algorithms for Spark. •A monitoring framework, named PAC, is proposed for performance analysis of compression algorithms in Spark.•An approach is proposed to integrate Zlib into Spark, resulting in up to a 27% performance enhancement for Spark applications in Zlib.•Root causes of the performance differences among compression algorithms in Spark are analyzed by PAC.•The research results offer significant potential for optimizing compression algorithms in Spark and for configuring and tuning Spark execution environments.
Author Li, Gang
Zhu, Changpeng
Han, Bo
Author_xml – sequence: 1
  givenname: Changpeng
  orcidid: 0000-0002-7272-7036
  surname: Zhu
  fullname: Zhu, Changpeng
  email: is99zcp@hotmail.com
  organization: Department of Data Science and Big Data, Chongqing University of Technology, China
– sequence: 2
  givenname: Bo
  orcidid: 0000-0003-2591-5859
  surname: Han
  fullname: Han, Bo
  email: bohan@xjtu.edu.cn
  organization: School of Cyber Science and Engineering, Xi’an Jiaotong University, China
– sequence: 3
  givenname: Gang
  surname: Li
  fullname: Li, Gang
  email: ligang@cqut.edu.cn
  organization: Department of Software Engineering, Chongqing University of Technology, China
BookMark eNqFkMtKAzEARYNUsK3-gYv8wIx5zExmuhBK8QUFBRXdhTSPmnYmGZKp0r83pa5c6Opu7rlwzwSMnHcagEuMcoxwdbXJzW7YBZ0TRIockRyh5gSMcc1IxjAuR2CcaixjtHk_A5MYNwghzCgeg7en-WIG57Dzzg4-WLeGJohOf_mwhcYH2OuQohNOaiicaPfRRugNlL7rg47RegdFu07o8NFFaB187kXYnoNTI9qoL35yCl5vb14W99ny8e5hMV9mkqJqyAjVlBSqZKJiuChLVZSsUYVCoqE1YYoxVUuDsF6ZkopGqpWpKlwjSWhhVGXoFBTHXRl8jEEb3gfbibDnGPGDHL7hRzn8IIcjwpOchM1-YdIOYkhnhiBs-x98fYR1OvZpdeBRWp0EKRu0HLjy9u-Bbxduhn0
CitedBy_id crossref_primary_10_1016_j_comnet_2025_111131
crossref_primary_10_1007_s10586_025_05509_4
crossref_primary_10_1016_j_hcc_2025_100341
Cites_doi 10.1007/s11227-022-04381-y
10.1016/j.future.2016.08.025
10.1109/MNET.109.2100384
10.1016/j.future.2018.12.002
10.1145/2791120
10.1109/ICDCS.2017.228
10.1109/TSC.2016.2611578
10.3390/s22134756
10.1109/TPDS.2020.2992073
10.1002/cpe.5523
10.1007/s11704-021-0118-1
ContentType Journal Article
Copyright 2024
Copyright_xml – notice: 2024
DBID AAYXX
CITATION
DOI 10.1016/j.future.2024.02.009
DatabaseName CrossRef
DatabaseTitle CrossRef
DatabaseTitleList
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISSN 1872-7115
EndPage 249
ExternalDocumentID 10_1016_j_future_2024_02_009
S0167739X24000554
GrantInformation_xml – fundername: National Natural Science Foundation of China
  funderid: http://dx.doi.org/10.13039/501100001809
GroupedDBID --K
--M
-~X
.DC
.~1
0R~
1B1
1~.
1~5
29H
4.4
457
4G.
5GY
5VS
7-5
71M
8P~
9JN
AACTN
AAEDT
AAEDW
AAIAV
AAIKJ
AAKOC
AALRI
AAOAW
AAQFI
AAQXK
AAXUO
AAYFN
ABBOA
ABFNM
ABJNI
ABMAC
ABXDB
ACDAQ
ACGFS
ACNNM
ACRLP
ACZNC
ADBBV
ADEZE
ADJOM
ADMUD
AEBSH
AEKER
AFKWA
AFTJW
AGHFR
AGUBO
AGYEJ
AHHHB
AHZHX
AIALX
AIEXJ
AIKHN
AITUG
AJOXV
ALMA_UNASSIGNED_HOLDINGS
AMFUW
AMRAJ
AOUOD
ASPBG
AVWKF
AXJTR
AZFZN
BKOJK
BLXMC
CS3
EBS
EFJIC
EJD
EO8
EO9
EP2
EP3
F5P
FDB
FEDTE
FGOYB
FIRID
FNPLU
FYGXN
G-Q
G8K
GBLVA
GBOLZ
HLZ
HVGLF
HZ~
IHE
J1W
KOM
LG9
M41
MO0
MS~
N9A
O-L
O9-
OAUVE
OZT
P-8
P-9
PC.
Q38
R2-
RIG
ROL
RPZ
SBC
SDF
SDG
SES
SEW
SPC
SPCBC
SSV
SSZ
T5K
UHS
WUQ
XPP
ZMT
~G-
9DU
AATTM
AAXKI
AAYWO
AAYXX
ABDPE
ABWVN
ACLOT
ACRPL
ADNMO
AEIPS
AFJKZ
AGQPQ
AIIUN
ANKPU
APXCP
CITATION
EFKBS
EFLBG
~HD
ID FETCH-LOGICAL-c306t-23e324d57a671455d4579d4d0a93827d77d8cf01ebf53a9cdbf66180c234fd6f3
ISICitedReferencesCount 2
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001223701400001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 0167-739X
IngestDate Sat Nov 29 03:48:12 EST 2025
Tue Nov 18 21:51:05 EST 2025
Sat May 25 15:40:24 EDT 2024
IsPeerReviewed true
IsScholarly true
Keywords Log analysis
Performance comparison
Root-cause analysis
Spark
Compression algorithms
Language English
LinkModel OpenURL
MergedId FETCHMERGED-LOGICAL-c306t-23e324d57a671455d4579d4d0a93827d77d8cf01ebf53a9cdbf66180c234fd6f3
ORCID 0000-0003-2591-5859
0000-0002-7272-7036
PageCount 13
ParticipantIDs crossref_primary_10_1016_j_future_2024_02_009
crossref_citationtrail_10_1016_j_future_2024_02_009
elsevier_sciencedirect_doi_10_1016_j_future_2024_02_009
PublicationCentury 2000
PublicationDate August 2024
2024-08-00
PublicationDateYYYYMMDD 2024-08-01
PublicationDate_xml – month: 08
  year: 2024
  text: August 2024
PublicationDecade 2020
PublicationTitle Future generation computer systems
PublicationYear 2024
Publisher Elsevier B.V
Publisher_xml – name: Elsevier B.V
References (b28) 2017; 68
Ivanov, Pergolesi (b29) 2020; 32
Natarajan, Simonis (b8) 2021
(b18) 2022
Zaharia, Chowdhury, Franklin, Shenker, Stoica (b1) 2010
Ananthanarayanan, Kandula, Greenberg, Stoica, Lu, Saha, Harris (b21) 2010
Qi, Li, Zhou, Li, Yang (b27) 2017
Lu, Hua (b13) 2019
Fu, Tang, Yang, Liu (b34) 2020; 31
(b5) 2022
He, Shenoy (b33) 2016
Gopal, Guilford, Gulley, van de Ven, Feghali (b7) 2014
(b16) 2022
Lu, Wei, Rao, Tak, Wang, Wang (b20) 2019; 95
Matsushita, Inoguchi (b12) 2022
(b3) 2022
Pu, Kimball, Lai, Zhu, Li, Park, Wang, Jayasinghe, Xiong, Malkowski, Wu, Jung, Koh, Swint (b26) 2017
Zhu, Han, Zhao (b35) 2022; 16
(b2) 2022
Garraghan, Ouyang, Yang, McKee, Xu (b22) 2019; 12
Ibidunmoye, Hernández-Rodriguez, Elmroth (b23) 2015; 48
Matteussi, Geyer, Xavier, De Rose (b24) 2018
Kwon, Kim, Kim, Kim (b6) 2017
(b14) 2023
(b19) 2022
Kovacs (b9) 2019
Ye, Xue, Tian, Li, Xiao, Xu, Wan (b10) 2022
Cardas, Aldana-Martín, Burgueño-Romero, Nebro, Mateos, Sánchez (b31) 2022
Huang, Huang, Dai, Xie, Huang (b15) 2010
Matteussi, dos Anjos, Leithardt, Geyer (b32) 2022; 22
(b4) 2022
Zhang, Zhang, Fang, Yuan (b11) 2022; 36
(b17) 2022
Zhang, Liu, Pu, Dou, Wu, Zhou (b30) 2018
C.-A. Lai, J. Kimball, T. Zhu, Q. Wang, C. Pu, milliScope: A Fine-Grained Monitoring Framework for Performance Debugging of n-Tier Web Services, in: 2017 IEEE 37th International Conference on Distributed Computing Systems, ICDCS, 2017, pp. 92–102.
Zhu, Han, Zhao (b36) 2022; 78
(10.1016/j.future.2024.02.009_b18) 2022
Zaharia (10.1016/j.future.2024.02.009_b1) 2010
(10.1016/j.future.2024.02.009_b3) 2022
Zhang (10.1016/j.future.2024.02.009_b11) 2022; 36
Ibidunmoye (10.1016/j.future.2024.02.009_b23) 2015; 48
Zhu (10.1016/j.future.2024.02.009_b35) 2022; 16
Lu (10.1016/j.future.2024.02.009_b20) 2019; 95
Zhang (10.1016/j.future.2024.02.009_b30) 2018
He (10.1016/j.future.2024.02.009_b33) 2016
Kovacs (10.1016/j.future.2024.02.009_b9) 2019
Matteussi (10.1016/j.future.2024.02.009_b24) 2018
Ivanov (10.1016/j.future.2024.02.009_b29) 2020; 32
Lu (10.1016/j.future.2024.02.009_b13) 2019
(10.1016/j.future.2024.02.009_b5) 2022
Ye (10.1016/j.future.2024.02.009_b10) 2022
(10.1016/j.future.2024.02.009_b16) 2022
Matsushita (10.1016/j.future.2024.02.009_b12) 2022
10.1016/j.future.2024.02.009_b25
(10.1016/j.future.2024.02.009_b19) 2022
Pu (10.1016/j.future.2024.02.009_b26) 2017
(10.1016/j.future.2024.02.009_b2) 2022
Natarajan (10.1016/j.future.2024.02.009_b8) 2021
(10.1016/j.future.2024.02.009_b14) 2023
Kwon (10.1016/j.future.2024.02.009_b6) 2017
Gopal (10.1016/j.future.2024.02.009_b7) 2014
Zhu (10.1016/j.future.2024.02.009_b36) 2022; 78
Cardas (10.1016/j.future.2024.02.009_b31) 2022
Garraghan (10.1016/j.future.2024.02.009_b22) 2019; 12
(10.1016/j.future.2024.02.009_b28) 2017; 68
Matteussi (10.1016/j.future.2024.02.009_b32) 2022; 22
Qi (10.1016/j.future.2024.02.009_b27) 2017
Fu (10.1016/j.future.2024.02.009_b34) 2020; 31
(10.1016/j.future.2024.02.009_b17) 2022
(10.1016/j.future.2024.02.009_b4) 2022
Ananthanarayanan (10.1016/j.future.2024.02.009_b21) 2010
Huang (10.1016/j.future.2024.02.009_b15) 2010
References_xml – start-page: 10
  year: 2010
  ident: b1
  article-title: Spark: Cluster computing with working sets
  publication-title: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing
– year: 2023
  ident: b14
  article-title: Lzbench
– year: 2022
  ident: b3
  article-title: LZf
– year: 2014
  ident: b7
  article-title: High performance ZLIB compression on intel architecture processors
– start-page: 265
  year: 2010
  end-page: 278
  ident: b21
  article-title: Reining in the outliers in map-reduce clusters using Mantri
  publication-title: Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation
– year: 2022
  ident: b31
  article-title: On the performance of SQL scalable systems on kubernetes: A comparative study
  publication-title: Cluster Comput.
– start-page: 901
  year: 2018
  end-page: 908
  ident: b24
  article-title: Understanding and minimizing disk contention effects for data-intensive processing in virtualized systems
  publication-title: 2018 International Conference on High Performance Computing & Simulation
– start-page: 788
  year: 2019
  end-page: 795
  ident: b13
  article-title: G-match: A fast GPU-friendly data compression algorithm
  publication-title: 2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems
– volume: 36
  start-page: 180
  year: 2022
  end-page: 187
  ident: b11
  article-title: Learning-based data transmissions for future 6G enabled industrial IoT: A data compression perspective
  publication-title: IEEE Netw.
– year: 2022
  ident: b2
  article-title: LZ4
– year: 2022
  ident: b5
  article-title: Zstd
– year: 2022
  ident: b17
  article-title: Collectl
– start-page: 254
  year: 2017
  end-page: 261
  ident: b27
  article-title: Data mining based root-cause analysis of performance bottleneck for big data workload
  publication-title: 2017 IEEE 19th International Conference on High Performance Computing and Communications; IEEE 15th International Conference on Smart City; IEEE 3rd International Conference on Data Science and Systems
– volume: 68
  start-page: 175
  year: 2017
  end-page: 182
  ident: b28
  article-title: A performance comparison of container-based technologies for the cloud
  publication-title: Future Gener. Comput. Syst.
– volume: 16
  year: 2022
  ident: b35
  article-title: A Bi-metric autoscaling approach for n-Tier web applications on kubernetes
  publication-title: Front. Comput. Sci.
– year: 2022
  ident: b18
  article-title: Zlib
– start-page: 473
  year: 2022
  ident: b12
  article-title: Applying practical parallel grammar compression to large-scale data
  publication-title: 2022 Data Compression Conference
– reference: C.-A. Lai, J. Kimball, T. Zhu, Q. Wang, C. Pu, milliScope: A Fine-Grained Monitoring Framework for Performance Debugging of n-Tier Web Services, in: 2017 IEEE 37th International Conference on Distributed Computing Systems, ICDCS, 2017, pp. 92–102.
– start-page: 178
  year: 2018
  end-page: 185
  ident: b30
  article-title: A comparative study of containers and virtual machines in big data environment
  publication-title: 2018 IEEE 11th International Conference on Cloud Computing
– start-page: 41
  year: 2010
  end-page: 51
  ident: b15
  article-title: The HiBench benchmark suite: Characterization of the MapReduce-based data analysis
  publication-title: 2010 IEEE 26th International Conference on Data Engineering Workshops
– volume: 22
  year: 2022
  ident: b32
  article-title: Performance evaluation analysis of spark streaming backpressure for data-intensive pipelines
  publication-title: Sensors
– volume: 31
  start-page: 2406
  year: 2020
  end-page: 2420
  ident: b34
  article-title: An optimal locality-aware task scheduling algorithm based on bipartite graph modelling for spark applications
  publication-title: IEEE Trans. Parallel Distrib. Syst.
– start-page: 420
  year: 2017
  end-page: 423
  ident: b6
  article-title: LZ4m: A fast compression algorithm for in-memory data
  publication-title: 2017 IEEE International Conference on Consumer Electronics
– start-page: 492
  year: 2022
  ident: b10
  article-title: Chunk content is not enough: Chunk-context aware resemblance detection for Deduplication Delta compression
  publication-title: 2022 Data Compression Conference
– volume: 78
  start-page: 13298
  year: 2022
  end-page: 13322
  ident: b36
  article-title: A comparative performance study of spark on kubernetes
  publication-title: J. Supercomput.
– year: 2022
  ident: b16
  article-title: System activity reporter
– volume: 48
  year: 2015
  ident: b23
  article-title: Performance anomaly detection and bottleneck identification
  publication-title: ACM Comput. Surv.
– start-page: 1
  year: 2016
  end-page: 10
  ident: b33
  article-title: Firebird: Network-aware task scheduling for spark using SDNs
  publication-title: 2016 25th International Conference on Computer Communication and Networks
– year: 2022
  ident: b19
  article-title: Performance gaps among compression algorithms
– volume: 95
  start-page: 392
  year: 2019
  end-page: 403
  ident: b20
  article-title: LADRA: Log-based abnormal task detection and root-cause analysis in big data processing with spark
  publication-title: Future Gener. Comput. Syst.
– volume: 32
  year: 2020
  ident: b29
  article-title: The impact of columnar file formats on SQL-on-hadoop engine performance: A study on ORC and parquet
  publication-title: Concurr. Comput.: Pract. Exper.
– year: 2021
  ident: b8
  article-title: Improving zlib-cloudflare and comparing performance with other zlib forks
– year: 2019
  ident: b9
  article-title: A Hardware Implementation of the Snappy Compression Algorithm
– volume: 12
  start-page: 91
  year: 2019
  end-page: 104
  ident: b22
  article-title: Straggler root-cause and impact analysis for massive-scale virtualized cloud datacenters
  publication-title: IEEE Trans. Serv. Comput.
– start-page: 1919
  year: 2017
  end-page: 1926
  ident: b26
  article-title: The Millibottleneck theory of performance bugs, and its experimental verification
  publication-title: 2017 IEEE 37th International Conference on Distributed Computing Systems
– year: 2022
  ident: b4
  article-title: Snappy
– volume: 78
  start-page: 13298
  issue: 11
  year: 2022
  ident: 10.1016/j.future.2024.02.009_b36
  article-title: A comparative performance study of spark on kubernetes
  publication-title: J. Supercomput.
  doi: 10.1007/s11227-022-04381-y
– volume: 68
  start-page: 175
  year: 2017
  ident: 10.1016/j.future.2024.02.009_b28
  article-title: A performance comparison of container-based technologies for the cloud
  publication-title: Future Gener. Comput. Syst.
  doi: 10.1016/j.future.2016.08.025
– volume: 36
  start-page: 180
  issue: 5
  year: 2022
  ident: 10.1016/j.future.2024.02.009_b11
  article-title: Learning-based data transmissions for future 6G enabled industrial IoT: A data compression perspective
  publication-title: IEEE Netw.
  doi: 10.1109/MNET.109.2100384
– year: 2014
  ident: 10.1016/j.future.2024.02.009_b7
– volume: 95
  start-page: 392
  year: 2019
  ident: 10.1016/j.future.2024.02.009_b20
  article-title: LADRA: Log-based abnormal task detection and root-cause analysis in big data processing with spark
  publication-title: Future Gener. Comput. Syst.
  doi: 10.1016/j.future.2018.12.002
– start-page: 1
  year: 2016
  ident: 10.1016/j.future.2024.02.009_b33
  article-title: Firebird: Network-aware task scheduling for spark using SDNs
– year: 2022
  ident: 10.1016/j.future.2024.02.009_b3
– start-page: 420
  year: 2017
  ident: 10.1016/j.future.2024.02.009_b6
  article-title: LZ4m: A fast compression algorithm for in-memory data
– volume: 48
  issue: 1
  year: 2015
  ident: 10.1016/j.future.2024.02.009_b23
  article-title: Performance anomaly detection and bottleneck identification
  publication-title: ACM Comput. Surv.
  doi: 10.1145/2791120
– year: 2023
  ident: 10.1016/j.future.2024.02.009_b14
– ident: 10.1016/j.future.2024.02.009_b25
  doi: 10.1109/ICDCS.2017.228
– volume: 12
  start-page: 91
  issue: 1
  year: 2019
  ident: 10.1016/j.future.2024.02.009_b22
  article-title: Straggler root-cause and impact analysis for massive-scale virtualized cloud datacenters
  publication-title: IEEE Trans. Serv. Comput.
  doi: 10.1109/TSC.2016.2611578
– start-page: 1919
  year: 2017
  ident: 10.1016/j.future.2024.02.009_b26
  article-title: The Millibottleneck theory of performance bugs, and its experimental verification
– year: 2021
  ident: 10.1016/j.future.2024.02.009_b8
– volume: 22
  issue: 13
  year: 2022
  ident: 10.1016/j.future.2024.02.009_b32
  article-title: Performance evaluation analysis of spark streaming backpressure for data-intensive pipelines
  publication-title: Sensors
  doi: 10.3390/s22134756
– year: 2022
  ident: 10.1016/j.future.2024.02.009_b4
– year: 2022
  ident: 10.1016/j.future.2024.02.009_b16
– start-page: 41
  year: 2010
  ident: 10.1016/j.future.2024.02.009_b15
  article-title: The HiBench benchmark suite: Characterization of the MapReduce-based data analysis
– year: 2022
  ident: 10.1016/j.future.2024.02.009_b31
  article-title: On the performance of SQL scalable systems on kubernetes: A comparative study
  publication-title: Cluster Comput.
– volume: 31
  start-page: 2406
  issue: 10
  year: 2020
  ident: 10.1016/j.future.2024.02.009_b34
  article-title: An optimal locality-aware task scheduling algorithm based on bipartite graph modelling for spark applications
  publication-title: IEEE Trans. Parallel Distrib. Syst.
  doi: 10.1109/TPDS.2020.2992073
– start-page: 492
  year: 2022
  ident: 10.1016/j.future.2024.02.009_b10
  article-title: Chunk content is not enough: Chunk-context aware resemblance detection for Deduplication Delta compression
– volume: 32
  issue: 5
  year: 2020
  ident: 10.1016/j.future.2024.02.009_b29
  article-title: The impact of columnar file formats on SQL-on-hadoop engine performance: A study on ORC and parquet
  publication-title: Concurr. Comput.: Pract. Exper.
  doi: 10.1002/cpe.5523
– volume: 16
  issue: 3
  year: 2022
  ident: 10.1016/j.future.2024.02.009_b35
  article-title: A Bi-metric autoscaling approach for n-Tier web applications on kubernetes
  publication-title: Front. Comput. Sci.
  doi: 10.1007/s11704-021-0118-1
– year: 2022
  ident: 10.1016/j.future.2024.02.009_b5
– year: 2022
  ident: 10.1016/j.future.2024.02.009_b17
– start-page: 265
  year: 2010
  ident: 10.1016/j.future.2024.02.009_b21
  article-title: Reining in the outliers in map-reduce clusters using Mantri
– start-page: 10
  year: 2010
  ident: 10.1016/j.future.2024.02.009_b1
  article-title: Spark: Cluster computing with working sets
– year: 2019
  ident: 10.1016/j.future.2024.02.009_b9
– start-page: 473
  year: 2022
  ident: 10.1016/j.future.2024.02.009_b12
  article-title: Applying practical parallel grammar compression to large-scale data
– start-page: 901
  year: 2018
  ident: 10.1016/j.future.2024.02.009_b24
  article-title: Understanding and minimizing disk contention effects for data-intensive processing in virtualized systems
– year: 2022
  ident: 10.1016/j.future.2024.02.009_b19
– start-page: 254
  year: 2017
  ident: 10.1016/j.future.2024.02.009_b27
  article-title: Data mining based root-cause analysis of performance bottleneck for big data workload
– year: 2022
  ident: 10.1016/j.future.2024.02.009_b18
– start-page: 788
  year: 2019
  ident: 10.1016/j.future.2024.02.009_b13
  article-title: G-match: A fast GPU-friendly data compression algorithm
– start-page: 178
  year: 2018
  ident: 10.1016/j.future.2024.02.009_b30
  article-title: A comparative study of containers and virtual machines in big data environment
– year: 2022
  ident: 10.1016/j.future.2024.02.009_b2
SSID ssj0001731
Score 2.4181325
Snippet In Spark, a massive amount of immediate data inevitably leads to excessive I/O overhead. To mitigate this issue, Spark incorporates four compression algorithms...
SourceID crossref
elsevier
SourceType Enrichment Source
Index Database
Publisher
StartPage 237
SubjectTerms Compression algorithms
Log analysis
Performance comparison
Root-cause analysis
Spark
Title PAC: A monitoring framework for performance analysis of compression algorithms in Spark
URI https://dx.doi.org/10.1016/j.future.2024.02.009
Volume 157
WOSCitedRecordID wos001223701400001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVESC
  databaseName: Elsevier SD Freedom Collection Journals 2021
  customDbUrl:
  eissn: 1872-7115
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0001731
  issn: 0167-739X
  databaseCode: AIEXJ
  dateStart: 19950201
  isFulltext: true
  titleUrlDefault: https://www.sciencedirect.com
  providerName: Elsevier
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1LT9wwELYQcOilD9qq9CUfuFVBwY5ju7cVoi9VCAmqrnqJEtvhIchGsFT8_E7ssbNARcuBS7SK4mR358vM2P7mG0I2nGh5rS1YQBiTFZaZrIbAnilupLK6LLTxXUu-y91dNZ3qPdyKufDtBGTXqasr3T-oqeEcGHsonb2HudNN4QR8BqPDEcwOx_8y_N5kO5Sbn_nX1fPr2kjB8qzC_lqxwChKMtDLAy22-1CfHsLQ-dGZ58vu97GkJ3b09FIkQ_9lhxAy2B4CtaFTqv7r6BK39bvD3mGc9C4vNIqfJUqQ5xV8rvESXIlgReLBpcVJcLqS-9a4o3cN-tPRPwaFFwy1LKiV3vLiYUHhZDPIqmwOzwrCqnqMWnGn_kYwSxTDyF47qcJdquEuVc4qX-65wqTQ4ARXJl93pt9S6N6S2MASf0istfSEwNvf5u-5zEJ-cvCUPMaJBZ0EQDwjS65bI09i0w6KPvw5-Qn4-EgndEQHTeigAAu6gA4a0UFnLV1ABx3RQY876tHxgvz4tHOw_SXD7hqZgWniPGPcQTJthaxLOajV20JIbQub15orJq2UVpk233JNK-B1NrZpIZdTuWG8aG3Z8pdkuZt17hWhpVLCiUbxkueFYaIuBo2oJmeu4Q4mBOuEx_-pMig9P3RAOa3ustI6ydKoPkiv_ON6GU1QYfoY0sIKcHXnyNf3fNIb8mjE_1uyPD-_dO_Iqvk9P744f4-g-gMFKZU6
linkProvider Elsevier
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=PAC%3A+A+monitoring+framework+for+performance+analysis+of+compression+algorithms+in+Spark&rft.jtitle=Future+generation+computer+systems&rft.au=Zhu%2C+Changpeng&rft.au=Han%2C+Bo&rft.au=Li%2C+Gang&rft.date=2024-08-01&rft.issn=0167-739X&rft.volume=157&rft.spage=237&rft.epage=249&rft_id=info:doi/10.1016%2Fj.future.2024.02.009&rft.externalDBID=n%2Fa&rft.externalDocID=10_1016_j_future_2024_02_009
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0167-739X&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0167-739X&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0167-739X&client=summon