TeraHeap: Reducing Memory Pressure in Managed Big Data Frameworks

Gespeichert in:
Bibliographische Detailangaben
Titel: TeraHeap: Reducing Memory Pressure in Managed Big Data Frameworks
Autoren: Kolokasis, Iacovos G., Evdorou, Giannos, Akram, Shoaib, Kozanitis, Christos, Papagiannis, Anastasios, Zakkak, Foivos S., Pratikakis, Polyvios, Bilas, Angelos
Verlagsinformationen: Zenodo
Publikationsjahr: 2023
Bestand: Zenodo
Schlagwörter: Java Virtual Machine (JVM), large analytics datasets, serialization, large managed heaps, memory management, garbage collection, memory hierarchy, fast storage devices
Beschreibung: Big data analytics frameworks, such as Spark and Giraph, need to process and cache massive amounts of data that do not always fit on the managed heap. Therefore, frameworks temporarily move long-lived objects outside the managed heap (off-heap) on a fast storage device. However, this practice results in (1) high serialization/deserialization (S/D) cost and (2) high memory pressure when off-heap objects are moved back to the heap for processing. In this paper, we propose TeraHeap, a system that eliminates S/D overhead and expensive GC scans for a large portion of the objects in big data frameworks. TeraHeap relies on three concepts. (1) It eliminates S/D cost by extending the managed runtime (JVM) to use a second high-capacity heap (H2) over a fast storage device. (2) It offers a simple hint-based interface, allowing big data analytics frameworks to leverage knowledge about objects to populate H2. (3) It reduces GC cost by fencing the garbage collector from scanning H2 objects while maintaining the illusion of a single managed heap. We implement TeraHeap in OpenJDK and evaluate it with 15 widely used applications in two real-world big data frameworks, Spark and Giraph. Our evaluation shows that for the same DRAM size, TeraHeap improves performance by up to 73% and 28% compared to native Spark and Giraph, respectively. Also, it provides better performance by consuming up to 4.6× and 1.2× less DRAM capacity than native Spark and Giraph, respectively. Finally, it outperforms Panthera, a state-of-the-art garbage collector for hybrid memories, by up to 69%.
Publikationsart: conference object
Sprache: unknown
Relation: https://zenodo.org/communities/eupex/; https://zenodo.org/communities/eu/; https://zenodo.org/records/8189457; oai:zenodo.org:8189457; https://doi.org/10.1145/3582016.3582045
DOI: 10.1145/3582016.3582045
Verfügbarkeit: https://doi.org/10.1145/3582016.3582045
https://zenodo.org/records/8189457
Rights: Creative Commons Attribution 4.0 International ; cc-by-4.0 ; https://creativecommons.org/licenses/by/4.0/legalcode
Dokumentencode: edsbas.216550A6
Datenbank: BASE
Beschreibung
Abstract:Big data analytics frameworks, such as Spark and Giraph, need to process and cache massive amounts of data that do not always fit on the managed heap. Therefore, frameworks temporarily move long-lived objects outside the managed heap (off-heap) on a fast storage device. However, this practice results in (1) high serialization/deserialization (S/D) cost and (2) high memory pressure when off-heap objects are moved back to the heap for processing. In this paper, we propose TeraHeap, a system that eliminates S/D overhead and expensive GC scans for a large portion of the objects in big data frameworks. TeraHeap relies on three concepts. (1) It eliminates S/D cost by extending the managed runtime (JVM) to use a second high-capacity heap (H2) over a fast storage device. (2) It offers a simple hint-based interface, allowing big data analytics frameworks to leverage knowledge about objects to populate H2. (3) It reduces GC cost by fencing the garbage collector from scanning H2 objects while maintaining the illusion of a single managed heap. We implement TeraHeap in OpenJDK and evaluate it with 15 widely used applications in two real-world big data frameworks, Spark and Giraph. Our evaluation shows that for the same DRAM size, TeraHeap improves performance by up to 73% and 28% compared to native Spark and Giraph, respectively. Also, it provides better performance by consuming up to 4.6× and 1.2× less DRAM capacity than native Spark and Giraph, respectively. Finally, it outperforms Panthera, a state-of-the-art garbage collector for hybrid memories, by up to 69%.
DOI:10.1145/3582016.3582045