Automated Performance and Correctness Debugging for Big Data Analytics

Saved in:
Bibliographic Details
Title: Automated Performance and Correctness Debugging for Big Data Analytics
Authors: Kim, Miryung, Teoh, Jia
Publisher Information: eScholarship, University of California 2022-01-01
Document Type: Electronic Resource
Abstract: The constantly increasing volume of data collected in every aspect of our daily lives has necessitated the development of more powerful and efficient analysis tools. In particular, data-intensive scalable computing (DISC) systems such as Google’s MapReduce [36], Apache Hadoop [4], and Apache Spark [5] have become valuable tools for consuming and analyzing large volumes of data. At the same time, these systems provide valuable programming abstractions and libraries which enable adoption by users from a wide variety of backgrounds such as business analytics and data science. However, the widespread adoption of DISC systems and their underlying complexity have also highlighted a gap between developers’ abilities to write applications and their abilities to understand the behavior of their applications. By merging distributed systems debugging techniques with software engineering ideas, our hypothesis is that we can design accurate yet scalable approaches for debugging and testing of big data analytics’ performance and correctness. To design such approaches, we first investigate how we can combine data provenance with latency propagation techniques in order to debug computation skew —abnormally high computation costs for a small subset of input data —by identifying expensive input records. Next, we investigate how we can extend taint analysis techniques with influence-based provenance for many-to-one dependencies to enhance root cause analysis and improve the precision of identifying fault-inducing input records. Finally, in order to replicate performance problems based on described symptoms, we investigate how we can redesign fuzz testing by targeting individual program components such as user-defined functions for focused, modular fuzzing, defining new guidance metrics for performance symptoms, and adding skew-inspired input mutations and mutation operation selector strategies. For the first hypothesis, we introduce PERFDEBUG, a post-mortem performance debugging tool
Index Terms: Computer science, Big Data, Computation Skew, Data-Intensive Scalable Computing, Debugging, publication
URL: https://escholarship.org/uc/item/2xp367hd
https://escholarship.org/content/qt2xp367hd/qt2xp367hd.pdf
Availability: Open access content. Open access content
public
Note: application/pdf
English
Other Numbers: CDLER oai:escholarship.org:ark:/13030/qt2xp367hd
qt2xp367hd
1367490541
Contributing Source: UC MASS DIGITIZATION
From OAIster®, provided by the OCLC Cooperative.
Accession Number: edsoai.on1367490541
Database: OAIster
Description
Abstract:The constantly increasing volume of data collected in every aspect of our daily lives has necessitated the development of more powerful and efficient analysis tools. In particular, data-intensive scalable computing (DISC) systems such as Google’s MapReduce [36], Apache Hadoop [4], and Apache Spark [5] have become valuable tools for consuming and analyzing large volumes of data. At the same time, these systems provide valuable programming abstractions and libraries which enable adoption by users from a wide variety of backgrounds such as business analytics and data science. However, the widespread adoption of DISC systems and their underlying complexity have also highlighted a gap between developers’ abilities to write applications and their abilities to understand the behavior of their applications. By merging distributed systems debugging techniques with software engineering ideas, our hypothesis is that we can design accurate yet scalable approaches for debugging and testing of big data analytics’ performance and correctness. To design such approaches, we first investigate how we can combine data provenance with latency propagation techniques in order to debug computation skew —abnormally high computation costs for a small subset of input data —by identifying expensive input records. Next, we investigate how we can extend taint analysis techniques with influence-based provenance for many-to-one dependencies to enhance root cause analysis and improve the precision of identifying fault-inducing input records. Finally, in order to replicate performance problems based on described symptoms, we investigate how we can redesign fuzz testing by targeting individual program components such as user-defined functions for focused, modular fuzzing, defining new guidance metrics for performance symptoms, and adding skew-inspired input mutations and mutation operation selector strategies. For the first hypothesis, we introduce PERFDEBUG, a post-mortem performance debugging tool