Apache Nemo: A Framework for Optimizing Distributed Data Processing.

Uloženo v:
Podrobná bibliografie
Název: Apache Nemo: A Framework for Optimizing Distributed Data Processing.
Autoři: WON WOOK SONG, YOUNGSEOK YANG, JEONGYOON EO, JANGHO SEO, JOO YEON KIM, SANHA LEE, GYEWON LEE, TAEGEON UM, HAEYOON CHO, BYUNG-GON CHUN
Zdroj: ACM Transactions on Computer Systems; Oct2021, Vol. 38 Issue 4, p1-31, 31p
Témata: DISTRIBUTED computing, COMPILERS (Computer programs), ELECTRONIC data processing, DATA transmission systems
Abstrakt: Optimizing scheduling and communication of distributed data processing for resource and data characteristics is crucial for achieving high performance. Existing approaches to such optimizations largely fall into two categories. First, distributed runtimes provide low-level policy interfaces to apply the optimizations, but do not ensure the maintenance of correct application semantics and thus often require significant effort to use. Second, policy interfaces that extend a high-level application programming model ensure correctness, but do not provide sufficient fine control. We describe Apache Nemo, an optimization framework for distributed dataflow processing that provides fine control for high performance and also ensures correctness for ease of use. We combine several techniques to achieve this, including an intermediate representation of dataflow, compiler optimization passes, and runtime extensions. Our evaluation results show that Nemo enables composable and reusable optimizations that bring performance improvements on par with existing specialized runtimes tailored for a specific deployment scenario. Apache Nemo is open-sourced at https://nemo.apache.org as an Apache incubator project. [ABSTRACT FROM AUTHOR]
Copyright of ACM Transactions on Computer Systems is the property of Association for Computing Machinery and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Databáze: Complementary Index
Popis
Abstrakt:Optimizing scheduling and communication of distributed data processing for resource and data characteristics is crucial for achieving high performance. Existing approaches to such optimizations largely fall into two categories. First, distributed runtimes provide low-level policy interfaces to apply the optimizations, but do not ensure the maintenance of correct application semantics and thus often require significant effort to use. Second, policy interfaces that extend a high-level application programming model ensure correctness, but do not provide sufficient fine control. We describe Apache Nemo, an optimization framework for distributed dataflow processing that provides fine control for high performance and also ensures correctness for ease of use. We combine several techniques to achieve this, including an intermediate representation of dataflow, compiler optimization passes, and runtime extensions. Our evaluation results show that Nemo enables composable and reusable optimizations that bring performance improvements on par with existing specialized runtimes tailored for a specific deployment scenario. Apache Nemo is open-sourced at https://nemo.apache.org as an Apache incubator project. [ABSTRACT FROM AUTHOR]
ISSN:07342071
DOI:10.1145/3468144