Automated Performance and Correctness Debugging for Big Data Analytics
Saved in:
| Title: | Automated Performance and Correctness Debugging for Big Data Analytics |
|---|---|
| Authors: | Kim, Miryung, Teoh, Jia |
| Publisher Information: | eScholarship, University of California 2022-01-01 |
| Document Type: | Electronic Resource |
| Abstract: | The constantly increasing volume of data collected in every aspect of our daily lives has necessitated the development of more powerful and efficient analysis tools. In particular, data-intensive scalable computing (DISC) systems such as Google’s MapReduce [36], Apache Hadoop [4], and Apache Spark [5] have become valuable tools for consuming and analyzing large volumes of data. At the same time, these systems provide valuable programming abstractions and libraries which enable adoption by users from a wide variety of backgrounds such as business analytics and data science. However, the widespread adoption of DISC systems and their underlying complexity have also highlighted a gap between developers’ abilities to write applications and their abilities to understand the behavior of their applications. By merging distributed systems debugging techniques with software engineering ideas, our hypothesis is that we can design accurate yet scalable approaches for debugging and testing of big data analytics’ performance and correctness. To design such approaches, we first investigate how we can combine data provenance with latency propagation techniques in order to debug computation skew —abnormally high computation costs for a small subset of input data —by identifying expensive input records. Next, we investigate how we can extend taint analysis techniques with influence-based provenance for many-to-one dependencies to enhance root cause analysis and improve the precision of identifying fault-inducing input records. Finally, in order to replicate performance problems based on described symptoms, we investigate how we can redesign fuzz testing by targeting individual program components such as user-defined functions for focused, modular fuzzing, defining new guidance metrics for performance symptoms, and adding skew-inspired input mutations and mutation operation selector strategies. For the first hypothesis, we introduce PERFDEBUG, a post-mortem performance debugging tool |
| Index Terms: | Computer science, Big Data, Computation Skew, Data-Intensive Scalable Computing, Debugging, publication |
| URL: | |
| Availability: | Open access content. Open access content public |
| Note: | application/pdf English |
| Other Numbers: | CDLER oai:escholarship.org:ark:/13030/qt2xp367hd qt2xp367hd 1367490541 |
| Contributing Source: | UC MASS DIGITIZATION From OAIster®, provided by the OCLC Cooperative. |
| Accession Number: | edsoai.on1367490541 |
| Database: | OAIster |
| Abstract: | The constantly increasing volume of data collected in every aspect of our daily lives has necessitated the development of more powerful and efficient analysis tools. In particular, data-intensive scalable computing (DISC) systems such as Google’s MapReduce [36], Apache Hadoop [4], and Apache Spark [5] have become valuable tools for consuming and analyzing large volumes of data. At the same time, these systems provide valuable programming abstractions and libraries which enable adoption by users from a wide variety of backgrounds such as business analytics and data science. However, the widespread adoption of DISC systems and their underlying complexity have also highlighted a gap between developers’ abilities to write applications and their abilities to understand the behavior of their applications. By merging distributed systems debugging techniques with software engineering ideas, our hypothesis is that we can design accurate yet scalable approaches for debugging and testing of big data analytics’ performance and correctness. To design such approaches, we first investigate how we can combine data provenance with latency propagation techniques in order to debug computation skew —abnormally high computation costs for a small subset of input data —by identifying expensive input records. Next, we investigate how we can extend taint analysis techniques with influence-based provenance for many-to-one dependencies to enhance root cause analysis and improve the precision of identifying fault-inducing input records. Finally, in order to replicate performance problems based on described symptoms, we investigate how we can redesign fuzz testing by targeting individual program components such as user-defined functions for focused, modular fuzzing, defining new guidance metrics for performance symptoms, and adding skew-inspired input mutations and mutation operation selector strategies. For the first hypothesis, we introduce PERFDEBUG, a post-mortem performance debugging tool |
|---|
Nájsť tento článok vo Web of Science