PAC: A monitoring framework for performance analysis of compression algorithms in Spark
In Spark, a massive amount of immediate data inevitably leads to excessive I/O overhead. To mitigate this issue, Spark incorporates four compression algorithms to reduce the size of the data for better performance. However, compression and decompression only constitute a portion of the overall logic...
Gespeichert in:
| Veröffentlicht in: | Future generation computer systems Jg. 157; S. 237 - 249 |
|---|---|
| Hauptverfasser: | , , |
| Format: | Journal Article |
| Sprache: | Englisch |
| Veröffentlicht: |
Elsevier B.V
01.08.2024
|
| Schlagworte: | |
| ISSN: | 0167-739X, 1872-7115 |
| Online-Zugang: | Volltext |
| Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
| Abstract | In Spark, a massive amount of immediate data inevitably leads to excessive I/O overhead. To mitigate this issue, Spark incorporates four compression algorithms to reduce the size of the data for better performance. However, compression and decompression only constitute a portion of the overall logical flows of Spark applications. This indicates a potential considerable interaction between compression algorithms and Spark applications regarding performance. Consequently, identifying factors that significantly impact the performance of compression algorithms in Spark, and subsequently, determining the actual performance benefits these algorithms provide to Spark applications, remains a significant challenge.
To address the challenge, this paper presents a monitoring framework, named PAC, for conducting in-depth and systematic performance analysis of compression algorithms in Spark. As the pioneer of such monitoring frameworks, PAC is built on top of Spark core and collaborates with multiple monitors to collect various types of performance metrics of compressors, correlates and integrates them into structured tuples by the data transformer in PAC. This makes it easier to diagnosis of factors that have a significant influence on the performance of compression algorithms in Spark. Upon utilizing PAC, our experiments reveal that new determinants include the input/output data sizes and types of compression/decompression invocations, CPU consumption for compressing a massive amount of data, and hardware utilization, besides traditional determinants. Moreover, these experiments demonstrate that ZSTD is more susceptible to performance issues when compressing and decompressing small data, despite the overall input and output data being huge. In terms of performance, LZ4 serves as a viable alternative to ZSTD. These findings not only benefit researchers and developers in making more informed decisions in terms of configuring and tuning Spark execution environments but also sustainably boost the optimization of compression algorithms for Spark.
•A monitoring framework, named PAC, is proposed for performance analysis of compression algorithms in Spark.•An approach is proposed to integrate Zlib into Spark, resulting in up to a 27% performance enhancement for Spark applications in Zlib.•Root causes of the performance differences among compression algorithms in Spark are analyzed by PAC.•The research results offer significant potential for optimizing compression algorithms in Spark and for configuring and tuning Spark execution environments. |
|---|---|
| AbstractList | In Spark, a massive amount of immediate data inevitably leads to excessive I/O overhead. To mitigate this issue, Spark incorporates four compression algorithms to reduce the size of the data for better performance. However, compression and decompression only constitute a portion of the overall logical flows of Spark applications. This indicates a potential considerable interaction between compression algorithms and Spark applications regarding performance. Consequently, identifying factors that significantly impact the performance of compression algorithms in Spark, and subsequently, determining the actual performance benefits these algorithms provide to Spark applications, remains a significant challenge.
To address the challenge, this paper presents a monitoring framework, named PAC, for conducting in-depth and systematic performance analysis of compression algorithms in Spark. As the pioneer of such monitoring frameworks, PAC is built on top of Spark core and collaborates with multiple monitors to collect various types of performance metrics of compressors, correlates and integrates them into structured tuples by the data transformer in PAC. This makes it easier to diagnosis of factors that have a significant influence on the performance of compression algorithms in Spark. Upon utilizing PAC, our experiments reveal that new determinants include the input/output data sizes and types of compression/decompression invocations, CPU consumption for compressing a massive amount of data, and hardware utilization, besides traditional determinants. Moreover, these experiments demonstrate that ZSTD is more susceptible to performance issues when compressing and decompressing small data, despite the overall input and output data being huge. In terms of performance, LZ4 serves as a viable alternative to ZSTD. These findings not only benefit researchers and developers in making more informed decisions in terms of configuring and tuning Spark execution environments but also sustainably boost the optimization of compression algorithms for Spark.
•A monitoring framework, named PAC, is proposed for performance analysis of compression algorithms in Spark.•An approach is proposed to integrate Zlib into Spark, resulting in up to a 27% performance enhancement for Spark applications in Zlib.•Root causes of the performance differences among compression algorithms in Spark are analyzed by PAC.•The research results offer significant potential for optimizing compression algorithms in Spark and for configuring and tuning Spark execution environments. |
| Author | Li, Gang Zhu, Changpeng Han, Bo |
| Author_xml | – sequence: 1 givenname: Changpeng orcidid: 0000-0002-7272-7036 surname: Zhu fullname: Zhu, Changpeng email: is99zcp@hotmail.com organization: Department of Data Science and Big Data, Chongqing University of Technology, China – sequence: 2 givenname: Bo orcidid: 0000-0003-2591-5859 surname: Han fullname: Han, Bo email: bohan@xjtu.edu.cn organization: School of Cyber Science and Engineering, Xi’an Jiaotong University, China – sequence: 3 givenname: Gang surname: Li fullname: Li, Gang email: ligang@cqut.edu.cn organization: Department of Software Engineering, Chongqing University of Technology, China |
| BookMark | eNqFkMtKAzEARYNUsK3-gYv8wIx5zExmuhBK8QUFBRXdhTSPmnYmGZKp0r83pa5c6Opu7rlwzwSMnHcagEuMcoxwdbXJzW7YBZ0TRIockRyh5gSMcc1IxjAuR2CcaixjtHk_A5MYNwghzCgeg7en-WIG57Dzzg4-WLeGJohOf_mwhcYH2OuQohNOaiicaPfRRugNlL7rg47RegdFu07o8NFFaB187kXYnoNTI9qoL35yCl5vb14W99ny8e5hMV9mkqJqyAjVlBSqZKJiuChLVZSsUYVCoqE1YYoxVUuDsF6ZkopGqpWpKlwjSWhhVGXoFBTHXRl8jEEb3gfbibDnGPGDHL7hRzn8IIcjwpOchM1-YdIOYkhnhiBs-x98fYR1OvZpdeBRWp0EKRu0HLjy9u-Bbxduhn0 |
| CitedBy_id | crossref_primary_10_1016_j_comnet_2025_111131 crossref_primary_10_1007_s10586_025_05509_4 crossref_primary_10_1016_j_hcc_2025_100341 |
| Cites_doi | 10.1007/s11227-022-04381-y 10.1016/j.future.2016.08.025 10.1109/MNET.109.2100384 10.1016/j.future.2018.12.002 10.1145/2791120 10.1109/ICDCS.2017.228 10.1109/TSC.2016.2611578 10.3390/s22134756 10.1109/TPDS.2020.2992073 10.1002/cpe.5523 10.1007/s11704-021-0118-1 |
| ContentType | Journal Article |
| Copyright | 2024 |
| Copyright_xml | – notice: 2024 |
| DBID | AAYXX CITATION |
| DOI | 10.1016/j.future.2024.02.009 |
| DatabaseName | CrossRef |
| DatabaseTitle | CrossRef |
| DatabaseTitleList | |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Computer Science |
| EISSN | 1872-7115 |
| EndPage | 249 |
| ExternalDocumentID | 10_1016_j_future_2024_02_009 S0167739X24000554 |
| GrantInformation_xml | – fundername: National Natural Science Foundation of China funderid: http://dx.doi.org/10.13039/501100001809 |
| GroupedDBID | --K --M -~X .DC .~1 0R~ 1B1 1~. 1~5 29H 4.4 457 4G. 5GY 5VS 7-5 71M 8P~ 9JN AACTN AAEDT AAEDW AAIAV AAIKJ AAKOC AALRI AAOAW AAQFI AAQXK AAXUO AAYFN ABBOA ABFNM ABJNI ABMAC ABXDB ACDAQ ACGFS ACNNM ACRLP ACZNC ADBBV ADEZE ADJOM ADMUD AEBSH AEKER AFKWA AFTJW AGHFR AGUBO AGYEJ AHHHB AHZHX AIALX AIEXJ AIKHN AITUG AJOXV ALMA_UNASSIGNED_HOLDINGS AMFUW AMRAJ AOUOD ASPBG AVWKF AXJTR AZFZN BKOJK BLXMC CS3 EBS EFJIC EJD EO8 EO9 EP2 EP3 F5P FDB FEDTE FGOYB FIRID FNPLU FYGXN G-Q G8K GBLVA GBOLZ HLZ HVGLF HZ~ IHE J1W KOM LG9 M41 MO0 MS~ N9A O-L O9- OAUVE OZT P-8 P-9 PC. Q38 R2- RIG ROL RPZ SBC SDF SDG SES SEW SPC SPCBC SSV SSZ T5K UHS WUQ XPP ZMT ~G- 9DU AATTM AAXKI AAYWO AAYXX ABDPE ABWVN ACLOT ACRPL ADNMO AEIPS AFJKZ AGQPQ AIIUN ANKPU APXCP CITATION EFKBS EFLBG ~HD |
| ID | FETCH-LOGICAL-c306t-23e324d57a671455d4579d4d0a93827d77d8cf01ebf53a9cdbf66180c234fd6f3 |
| ISICitedReferencesCount | 2 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001223701400001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 0167-739X |
| IngestDate | Sat Nov 29 03:48:12 EST 2025 Tue Nov 18 21:51:05 EST 2025 Sat May 25 15:40:24 EDT 2024 |
| IsPeerReviewed | true |
| IsScholarly | true |
| Keywords | Log analysis Performance comparison Root-cause analysis Spark Compression algorithms |
| Language | English |
| LinkModel | OpenURL |
| MergedId | FETCHMERGED-LOGICAL-c306t-23e324d57a671455d4579d4d0a93827d77d8cf01ebf53a9cdbf66180c234fd6f3 |
| ORCID | 0000-0003-2591-5859 0000-0002-7272-7036 |
| PageCount | 13 |
| ParticipantIDs | crossref_primary_10_1016_j_future_2024_02_009 crossref_citationtrail_10_1016_j_future_2024_02_009 elsevier_sciencedirect_doi_10_1016_j_future_2024_02_009 |
| PublicationCentury | 2000 |
| PublicationDate | August 2024 2024-08-00 |
| PublicationDateYYYYMMDD | 2024-08-01 |
| PublicationDate_xml | – month: 08 year: 2024 text: August 2024 |
| PublicationDecade | 2020 |
| PublicationTitle | Future generation computer systems |
| PublicationYear | 2024 |
| Publisher | Elsevier B.V |
| Publisher_xml | – name: Elsevier B.V |
| References | (b28) 2017; 68 Ivanov, Pergolesi (b29) 2020; 32 Natarajan, Simonis (b8) 2021 (b18) 2022 Zaharia, Chowdhury, Franklin, Shenker, Stoica (b1) 2010 Ananthanarayanan, Kandula, Greenberg, Stoica, Lu, Saha, Harris (b21) 2010 Qi, Li, Zhou, Li, Yang (b27) 2017 Lu, Hua (b13) 2019 Fu, Tang, Yang, Liu (b34) 2020; 31 (b5) 2022 He, Shenoy (b33) 2016 Gopal, Guilford, Gulley, van de Ven, Feghali (b7) 2014 (b16) 2022 Lu, Wei, Rao, Tak, Wang, Wang (b20) 2019; 95 Matsushita, Inoguchi (b12) 2022 (b3) 2022 Pu, Kimball, Lai, Zhu, Li, Park, Wang, Jayasinghe, Xiong, Malkowski, Wu, Jung, Koh, Swint (b26) 2017 Zhu, Han, Zhao (b35) 2022; 16 (b2) 2022 Garraghan, Ouyang, Yang, McKee, Xu (b22) 2019; 12 Ibidunmoye, Hernández-Rodriguez, Elmroth (b23) 2015; 48 Matteussi, Geyer, Xavier, De Rose (b24) 2018 Kwon, Kim, Kim, Kim (b6) 2017 (b14) 2023 (b19) 2022 Kovacs (b9) 2019 Ye, Xue, Tian, Li, Xiao, Xu, Wan (b10) 2022 Cardas, Aldana-Martín, Burgueño-Romero, Nebro, Mateos, Sánchez (b31) 2022 Huang, Huang, Dai, Xie, Huang (b15) 2010 Matteussi, dos Anjos, Leithardt, Geyer (b32) 2022; 22 (b4) 2022 Zhang, Zhang, Fang, Yuan (b11) 2022; 36 (b17) 2022 Zhang, Liu, Pu, Dou, Wu, Zhou (b30) 2018 C.-A. Lai, J. Kimball, T. Zhu, Q. Wang, C. Pu, milliScope: A Fine-Grained Monitoring Framework for Performance Debugging of n-Tier Web Services, in: 2017 IEEE 37th International Conference on Distributed Computing Systems, ICDCS, 2017, pp. 92–102. Zhu, Han, Zhao (b36) 2022; 78 (10.1016/j.future.2024.02.009_b18) 2022 Zaharia (10.1016/j.future.2024.02.009_b1) 2010 (10.1016/j.future.2024.02.009_b3) 2022 Zhang (10.1016/j.future.2024.02.009_b11) 2022; 36 Ibidunmoye (10.1016/j.future.2024.02.009_b23) 2015; 48 Zhu (10.1016/j.future.2024.02.009_b35) 2022; 16 Lu (10.1016/j.future.2024.02.009_b20) 2019; 95 Zhang (10.1016/j.future.2024.02.009_b30) 2018 He (10.1016/j.future.2024.02.009_b33) 2016 Kovacs (10.1016/j.future.2024.02.009_b9) 2019 Matteussi (10.1016/j.future.2024.02.009_b24) 2018 Ivanov (10.1016/j.future.2024.02.009_b29) 2020; 32 Lu (10.1016/j.future.2024.02.009_b13) 2019 (10.1016/j.future.2024.02.009_b5) 2022 Ye (10.1016/j.future.2024.02.009_b10) 2022 (10.1016/j.future.2024.02.009_b16) 2022 Matsushita (10.1016/j.future.2024.02.009_b12) 2022 10.1016/j.future.2024.02.009_b25 (10.1016/j.future.2024.02.009_b19) 2022 Pu (10.1016/j.future.2024.02.009_b26) 2017 (10.1016/j.future.2024.02.009_b2) 2022 Natarajan (10.1016/j.future.2024.02.009_b8) 2021 (10.1016/j.future.2024.02.009_b14) 2023 Kwon (10.1016/j.future.2024.02.009_b6) 2017 Gopal (10.1016/j.future.2024.02.009_b7) 2014 Zhu (10.1016/j.future.2024.02.009_b36) 2022; 78 Cardas (10.1016/j.future.2024.02.009_b31) 2022 Garraghan (10.1016/j.future.2024.02.009_b22) 2019; 12 (10.1016/j.future.2024.02.009_b28) 2017; 68 Matteussi (10.1016/j.future.2024.02.009_b32) 2022; 22 Qi (10.1016/j.future.2024.02.009_b27) 2017 Fu (10.1016/j.future.2024.02.009_b34) 2020; 31 (10.1016/j.future.2024.02.009_b17) 2022 (10.1016/j.future.2024.02.009_b4) 2022 Ananthanarayanan (10.1016/j.future.2024.02.009_b21) 2010 Huang (10.1016/j.future.2024.02.009_b15) 2010 |
| References_xml | – start-page: 10 year: 2010 ident: b1 article-title: Spark: Cluster computing with working sets publication-title: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing – year: 2023 ident: b14 article-title: Lzbench – year: 2022 ident: b3 article-title: LZf – year: 2014 ident: b7 article-title: High performance ZLIB compression on intel architecture processors – start-page: 265 year: 2010 end-page: 278 ident: b21 article-title: Reining in the outliers in map-reduce clusters using Mantri publication-title: Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation – year: 2022 ident: b31 article-title: On the performance of SQL scalable systems on kubernetes: A comparative study publication-title: Cluster Comput. – start-page: 901 year: 2018 end-page: 908 ident: b24 article-title: Understanding and minimizing disk contention effects for data-intensive processing in virtualized systems publication-title: 2018 International Conference on High Performance Computing & Simulation – start-page: 788 year: 2019 end-page: 795 ident: b13 article-title: G-match: A fast GPU-friendly data compression algorithm publication-title: 2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems – volume: 36 start-page: 180 year: 2022 end-page: 187 ident: b11 article-title: Learning-based data transmissions for future 6G enabled industrial IoT: A data compression perspective publication-title: IEEE Netw. – year: 2022 ident: b2 article-title: LZ4 – year: 2022 ident: b5 article-title: Zstd – year: 2022 ident: b17 article-title: Collectl – start-page: 254 year: 2017 end-page: 261 ident: b27 article-title: Data mining based root-cause analysis of performance bottleneck for big data workload publication-title: 2017 IEEE 19th International Conference on High Performance Computing and Communications; IEEE 15th International Conference on Smart City; IEEE 3rd International Conference on Data Science and Systems – volume: 68 start-page: 175 year: 2017 end-page: 182 ident: b28 article-title: A performance comparison of container-based technologies for the cloud publication-title: Future Gener. Comput. Syst. – volume: 16 year: 2022 ident: b35 article-title: A Bi-metric autoscaling approach for n-Tier web applications on kubernetes publication-title: Front. Comput. Sci. – year: 2022 ident: b18 article-title: Zlib – start-page: 473 year: 2022 ident: b12 article-title: Applying practical parallel grammar compression to large-scale data publication-title: 2022 Data Compression Conference – reference: C.-A. Lai, J. Kimball, T. Zhu, Q. Wang, C. Pu, milliScope: A Fine-Grained Monitoring Framework for Performance Debugging of n-Tier Web Services, in: 2017 IEEE 37th International Conference on Distributed Computing Systems, ICDCS, 2017, pp. 92–102. – start-page: 178 year: 2018 end-page: 185 ident: b30 article-title: A comparative study of containers and virtual machines in big data environment publication-title: 2018 IEEE 11th International Conference on Cloud Computing – start-page: 41 year: 2010 end-page: 51 ident: b15 article-title: The HiBench benchmark suite: Characterization of the MapReduce-based data analysis publication-title: 2010 IEEE 26th International Conference on Data Engineering Workshops – volume: 22 year: 2022 ident: b32 article-title: Performance evaluation analysis of spark streaming backpressure for data-intensive pipelines publication-title: Sensors – volume: 31 start-page: 2406 year: 2020 end-page: 2420 ident: b34 article-title: An optimal locality-aware task scheduling algorithm based on bipartite graph modelling for spark applications publication-title: IEEE Trans. Parallel Distrib. Syst. – start-page: 420 year: 2017 end-page: 423 ident: b6 article-title: LZ4m: A fast compression algorithm for in-memory data publication-title: 2017 IEEE International Conference on Consumer Electronics – start-page: 492 year: 2022 ident: b10 article-title: Chunk content is not enough: Chunk-context aware resemblance detection for Deduplication Delta compression publication-title: 2022 Data Compression Conference – volume: 78 start-page: 13298 year: 2022 end-page: 13322 ident: b36 article-title: A comparative performance study of spark on kubernetes publication-title: J. Supercomput. – year: 2022 ident: b16 article-title: System activity reporter – volume: 48 year: 2015 ident: b23 article-title: Performance anomaly detection and bottleneck identification publication-title: ACM Comput. Surv. – start-page: 1 year: 2016 end-page: 10 ident: b33 article-title: Firebird: Network-aware task scheduling for spark using SDNs publication-title: 2016 25th International Conference on Computer Communication and Networks – year: 2022 ident: b19 article-title: Performance gaps among compression algorithms – volume: 95 start-page: 392 year: 2019 end-page: 403 ident: b20 article-title: LADRA: Log-based abnormal task detection and root-cause analysis in big data processing with spark publication-title: Future Gener. Comput. Syst. – volume: 32 year: 2020 ident: b29 article-title: The impact of columnar file formats on SQL-on-hadoop engine performance: A study on ORC and parquet publication-title: Concurr. Comput.: Pract. Exper. – year: 2021 ident: b8 article-title: Improving zlib-cloudflare and comparing performance with other zlib forks – year: 2019 ident: b9 article-title: A Hardware Implementation of the Snappy Compression Algorithm – volume: 12 start-page: 91 year: 2019 end-page: 104 ident: b22 article-title: Straggler root-cause and impact analysis for massive-scale virtualized cloud datacenters publication-title: IEEE Trans. Serv. Comput. – start-page: 1919 year: 2017 end-page: 1926 ident: b26 article-title: The Millibottleneck theory of performance bugs, and its experimental verification publication-title: 2017 IEEE 37th International Conference on Distributed Computing Systems – year: 2022 ident: b4 article-title: Snappy – volume: 78 start-page: 13298 issue: 11 year: 2022 ident: 10.1016/j.future.2024.02.009_b36 article-title: A comparative performance study of spark on kubernetes publication-title: J. Supercomput. doi: 10.1007/s11227-022-04381-y – volume: 68 start-page: 175 year: 2017 ident: 10.1016/j.future.2024.02.009_b28 article-title: A performance comparison of container-based technologies for the cloud publication-title: Future Gener. Comput. Syst. doi: 10.1016/j.future.2016.08.025 – volume: 36 start-page: 180 issue: 5 year: 2022 ident: 10.1016/j.future.2024.02.009_b11 article-title: Learning-based data transmissions for future 6G enabled industrial IoT: A data compression perspective publication-title: IEEE Netw. doi: 10.1109/MNET.109.2100384 – year: 2014 ident: 10.1016/j.future.2024.02.009_b7 – volume: 95 start-page: 392 year: 2019 ident: 10.1016/j.future.2024.02.009_b20 article-title: LADRA: Log-based abnormal task detection and root-cause analysis in big data processing with spark publication-title: Future Gener. Comput. Syst. doi: 10.1016/j.future.2018.12.002 – start-page: 1 year: 2016 ident: 10.1016/j.future.2024.02.009_b33 article-title: Firebird: Network-aware task scheduling for spark using SDNs – year: 2022 ident: 10.1016/j.future.2024.02.009_b3 – start-page: 420 year: 2017 ident: 10.1016/j.future.2024.02.009_b6 article-title: LZ4m: A fast compression algorithm for in-memory data – volume: 48 issue: 1 year: 2015 ident: 10.1016/j.future.2024.02.009_b23 article-title: Performance anomaly detection and bottleneck identification publication-title: ACM Comput. Surv. doi: 10.1145/2791120 – year: 2023 ident: 10.1016/j.future.2024.02.009_b14 – ident: 10.1016/j.future.2024.02.009_b25 doi: 10.1109/ICDCS.2017.228 – volume: 12 start-page: 91 issue: 1 year: 2019 ident: 10.1016/j.future.2024.02.009_b22 article-title: Straggler root-cause and impact analysis for massive-scale virtualized cloud datacenters publication-title: IEEE Trans. Serv. Comput. doi: 10.1109/TSC.2016.2611578 – start-page: 1919 year: 2017 ident: 10.1016/j.future.2024.02.009_b26 article-title: The Millibottleneck theory of performance bugs, and its experimental verification – year: 2021 ident: 10.1016/j.future.2024.02.009_b8 – volume: 22 issue: 13 year: 2022 ident: 10.1016/j.future.2024.02.009_b32 article-title: Performance evaluation analysis of spark streaming backpressure for data-intensive pipelines publication-title: Sensors doi: 10.3390/s22134756 – year: 2022 ident: 10.1016/j.future.2024.02.009_b4 – year: 2022 ident: 10.1016/j.future.2024.02.009_b16 – start-page: 41 year: 2010 ident: 10.1016/j.future.2024.02.009_b15 article-title: The HiBench benchmark suite: Characterization of the MapReduce-based data analysis – year: 2022 ident: 10.1016/j.future.2024.02.009_b31 article-title: On the performance of SQL scalable systems on kubernetes: A comparative study publication-title: Cluster Comput. – volume: 31 start-page: 2406 issue: 10 year: 2020 ident: 10.1016/j.future.2024.02.009_b34 article-title: An optimal locality-aware task scheduling algorithm based on bipartite graph modelling for spark applications publication-title: IEEE Trans. Parallel Distrib. Syst. doi: 10.1109/TPDS.2020.2992073 – start-page: 492 year: 2022 ident: 10.1016/j.future.2024.02.009_b10 article-title: Chunk content is not enough: Chunk-context aware resemblance detection for Deduplication Delta compression – volume: 32 issue: 5 year: 2020 ident: 10.1016/j.future.2024.02.009_b29 article-title: The impact of columnar file formats on SQL-on-hadoop engine performance: A study on ORC and parquet publication-title: Concurr. Comput.: Pract. Exper. doi: 10.1002/cpe.5523 – volume: 16 issue: 3 year: 2022 ident: 10.1016/j.future.2024.02.009_b35 article-title: A Bi-metric autoscaling approach for n-Tier web applications on kubernetes publication-title: Front. Comput. Sci. doi: 10.1007/s11704-021-0118-1 – year: 2022 ident: 10.1016/j.future.2024.02.009_b5 – year: 2022 ident: 10.1016/j.future.2024.02.009_b17 – start-page: 265 year: 2010 ident: 10.1016/j.future.2024.02.009_b21 article-title: Reining in the outliers in map-reduce clusters using Mantri – start-page: 10 year: 2010 ident: 10.1016/j.future.2024.02.009_b1 article-title: Spark: Cluster computing with working sets – year: 2019 ident: 10.1016/j.future.2024.02.009_b9 – start-page: 473 year: 2022 ident: 10.1016/j.future.2024.02.009_b12 article-title: Applying practical parallel grammar compression to large-scale data – start-page: 901 year: 2018 ident: 10.1016/j.future.2024.02.009_b24 article-title: Understanding and minimizing disk contention effects for data-intensive processing in virtualized systems – year: 2022 ident: 10.1016/j.future.2024.02.009_b19 – start-page: 254 year: 2017 ident: 10.1016/j.future.2024.02.009_b27 article-title: Data mining based root-cause analysis of performance bottleneck for big data workload – year: 2022 ident: 10.1016/j.future.2024.02.009_b18 – start-page: 788 year: 2019 ident: 10.1016/j.future.2024.02.009_b13 article-title: G-match: A fast GPU-friendly data compression algorithm – start-page: 178 year: 2018 ident: 10.1016/j.future.2024.02.009_b30 article-title: A comparative study of containers and virtual machines in big data environment – year: 2022 ident: 10.1016/j.future.2024.02.009_b2 |
| SSID | ssj0001731 |
| Score | 2.4181325 |
| Snippet | In Spark, a massive amount of immediate data inevitably leads to excessive I/O overhead. To mitigate this issue, Spark incorporates four compression algorithms... |
| SourceID | crossref elsevier |
| SourceType | Enrichment Source Index Database Publisher |
| StartPage | 237 |
| SubjectTerms | Compression algorithms Log analysis Performance comparison Root-cause analysis Spark |
| Title | PAC: A monitoring framework for performance analysis of compression algorithms in Spark |
| URI | https://dx.doi.org/10.1016/j.future.2024.02.009 |
| Volume | 157 |
| WOSCitedRecordID | wos001223701400001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVESC databaseName: Elsevier SD Freedom Collection Journals 2021 customDbUrl: eissn: 1872-7115 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0001731 issn: 0167-739X databaseCode: AIEXJ dateStart: 19950201 isFulltext: true titleUrlDefault: https://www.sciencedirect.com providerName: Elsevier |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1LT9wwELYQcOilD9qq9CUfuFVBwY5ju7cVoi9VCAmqrnqJEtvhIchGsFT8_E7ssbNARcuBS7SK4mR358vM2P7mG0I2nGh5rS1YQBiTFZaZrIbAnilupLK6LLTxXUu-y91dNZ3qPdyKufDtBGTXqasr3T-oqeEcGHsonb2HudNN4QR8BqPDEcwOx_8y_N5kO5Sbn_nX1fPr2kjB8qzC_lqxwChKMtDLAy22-1CfHsLQ-dGZ58vu97GkJ3b09FIkQ_9lhxAy2B4CtaFTqv7r6BK39bvD3mGc9C4vNIqfJUqQ5xV8rvESXIlgReLBpcVJcLqS-9a4o3cN-tPRPwaFFwy1LKiV3vLiYUHhZDPIqmwOzwrCqnqMWnGn_kYwSxTDyF47qcJdquEuVc4qX-65wqTQ4ARXJl93pt9S6N6S2MASf0istfSEwNvf5u-5zEJ-cvCUPMaJBZ0EQDwjS65bI09i0w6KPvw5-Qn4-EgndEQHTeigAAu6gA4a0UFnLV1ABx3RQY876tHxgvz4tHOw_SXD7hqZgWniPGPcQTJthaxLOajV20JIbQub15orJq2UVpk233JNK-B1NrZpIZdTuWG8aG3Z8pdkuZt17hWhpVLCiUbxkueFYaIuBo2oJmeu4Q4mBOuEx_-pMig9P3RAOa3ustI6ydKoPkiv_ON6GU1QYfoY0sIKcHXnyNf3fNIb8mjE_1uyPD-_dO_Iqvk9P744f4-g-gMFKZU6 |
| linkProvider | Elsevier |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=PAC%3A+A+monitoring+framework+for+performance+analysis+of+compression+algorithms+in+Spark&rft.jtitle=Future+generation+computer+systems&rft.au=Zhu%2C+Changpeng&rft.au=Han%2C+Bo&rft.au=Li%2C+Gang&rft.date=2024-08-01&rft.issn=0167-739X&rft.volume=157&rft.spage=237&rft.epage=249&rft_id=info:doi/10.1016%2Fj.future.2024.02.009&rft.externalDBID=n%2Fa&rft.externalDocID=10_1016_j_future_2024_02_009 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0167-739X&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0167-739X&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0167-739X&client=summon |