SEDGE: Symbolic example data generation for dataflow programs

Exhaustive, automatic testing of dataflow (esp. mapreduce) programs has emerged as an important challenge. Past work demonstrated effective ways to generate small example data sets that exercise operators in the Pig platform, used to generate Hadoop map-reduce programs. Although such prior technique...

Full description

Saved in:
Bibliographic Details
Published in:2013 IEEE/ACM 28th International Conference on Automated Software Engineering (ASE) pp. 235 - 245
Main Authors: Kaituo Li, Reichenbach, Christoph, Smaragdakis, Yannis, Diao, Yanlei, Csallner, Christoph
Format: Conference Proceeding
Language:English
Published: IEEE 01.11.2013
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract Exhaustive, automatic testing of dataflow (esp. mapreduce) programs has emerged as an important challenge. Past work demonstrated effective ways to generate small example data sets that exercise operators in the Pig platform, used to generate Hadoop map-reduce programs. Although such prior techniques attempt to cover all cases of operator use, in practice they often fail. Our SEDGE system addresses these completeness problems: for every dataflow operator, we produce data aiming to cover all cases that arise in the dataflow program (e.g., both passing and failing a filter). SEDGE relies on transforming the program into symbolic constraints, and solving the constraints using a symbolic reasoning engine (a powerful SMT solver), while using input data as concrete aids in the solution process. The approach resembles dynamic-symbolic (a.k.a. "concolic") execution in a conventional programming language, adapted to the unique features of the dataflow domain. In third-party benchmarks, SEDGE achieves higher coverage than past techniques for 5 out of 20 PigMix benchmarks and 7 out of 11 SDSS benchmarks and (with equal coverage for the rest of the benchmarks). We also show that our targeting of the high-level dataflow language pays off: for complex programs, state-of-the-art dynamic-symbolic execution at the level of the generated map-reduce code (instead of the original dataflow program) requires many more test cases or achieves much lower coverage than our approach.
AbstractList Exhaustive, automatic testing of dataflow (esp. mapreduce) programs has emerged as an important challenge. Past work demonstrated effective ways to generate small example data sets that exercise operators in the Pig platform, used to generate Hadoop map-reduce programs. Although such prior techniques attempt to cover all cases of operator use, in practice they often fail. Our SEDGE system addresses these completeness problems: for every dataflow operator, we produce data aiming to cover all cases that arise in the dataflow program (e.g., both passing and failing a filter). SEDGE relies on transforming the program into symbolic constraints, and solving the constraints using a symbolic reasoning engine (a powerful SMT solver), while using input data as concrete aids in the solution process. The approach resembles dynamic-symbolic (a.k.a. "concolic") execution in a conventional programming language, adapted to the unique features of the dataflow domain. In third-party benchmarks, SEDGE achieves higher coverage than past techniques for 5 out of 20 PigMix benchmarks and 7 out of 11 SDSS benchmarks and (with equal coverage for the rest of the benchmarks). We also show that our targeting of the high-level dataflow language pays off: for complex programs, state-of-the-art dynamic-symbolic execution at the level of the generated map-reduce code (instead of the original dataflow program) requires many more test cases or achieves much lower coverage than our approach.
Author Kaituo Li
Reichenbach, Christoph
Diao, Yanlei
Csallner, Christoph
Smaragdakis, Yannis
Author_xml – sequence: 1
  surname: Kaituo Li
  fullname: Kaituo Li
  organization: Comput. Sci. Dept., Univ. of Massachusetts, Amherst, MA, USA
– sequence: 2
  givenname: Christoph
  surname: Reichenbach
  fullname: Reichenbach, Christoph
  organization: Inst. of Inf., Goethe Univ. Frankfurt, Frankfurt am Main, Germany
– sequence: 3
  givenname: Yannis
  surname: Smaragdakis
  fullname: Smaragdakis, Yannis
  organization: Dept. of Inf., Univ. of Athens, Athens, Greece
– sequence: 4
  givenname: Yanlei
  surname: Diao
  fullname: Diao, Yanlei
  organization: Comput. Sci. Dept., Univ. of Massachusetts, Amherst, MA, USA
– sequence: 5
  givenname: Christoph
  surname: Csallner
  fullname: Csallner, Christoph
  organization: Comput. Sci. & Eng., Univ. of Texas at Arlington, Arlington, TX, USA
BookMark eNotj8tKw0AARUdQUGv2gpv5gcR5PwQXpcYqFFxE12WeJZJkwiSg_XtD7epyz4EL9xZcDmkIANxjVGGM9OO6qSuCMK2E0BQpegEKLRVmUmtEMBfXoJimb4SWslDGbsBzU79s6yfYHHubutbB8Gv6sQvQm9nAQxhCNnObBhhTPrHYpR845nTIpp_uwFU03RSKc67A12v9uXkrdx_b9816VxrC9FxyYR0RCDnFPbU6REdlpFZaz5UXzjJJpBdcEEaVZk5pJPziBbMUW4UlXYGH_902hLAfc9ubfNyfT9I_joBHzg
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/ASE.2013.6693083
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEEE Xplore
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Xplore
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISBN 9781479902156
1479902152
EndPage 245
ExternalDocumentID 6693083
Genre orig-research
GroupedDBID 6IE
6IL
ACM
ALMA_UNASSIGNED_HOLDINGS
APO
CBEJK
GUFHI
LHSKQ
RIB
RIC
RIE
RIL
ID FETCH-LOGICAL-a249t-56bc2600c85d3b9efc37f3b7bd58d6cb4727d656243894c8906df3b64b31b8173
IEDL.DBID RIE
ISICitedReferencesCount 12
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000331090200025&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate Wed Aug 27 03:48:50 EDT 2025
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a249t-56bc2600c85d3b9efc37f3b7bd58d6cb4727d656243894c8906df3b64b31b8173
PageCount 11
ParticipantIDs ieee_primary_6693083
PublicationCentury 2000
PublicationDate 2013-Nov.
PublicationDateYYYYMMDD 2013-11-01
PublicationDate_xml – month: 11
  year: 2013
  text: 2013-Nov.
PublicationDecade 2010
PublicationTitle 2013 IEEE/ACM 28th International Conference on Automated Software Engineering (ASE)
PublicationTitleAbbrev ASE
PublicationYear 2013
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0002181444
Score 1.6184977
Snippet Exhaustive, automatic testing of dataflow (esp. mapreduce) programs has emerged as an important challenge. Past work demonstrated effective ways to generate...
SourceID ieee
SourceType Publisher
StartPage 235
SubjectTerms Benchmark testing
Cognition
Concrete
Data processing
Educational institutions
Extraterrestrial measurements
Programming
Title SEDGE: Symbolic example data generation for dataflow programs
URI https://ieeexplore.ieee.org/document/6693083
WOSCitedRecordID wos000331090200025&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LSwMxEB7a4sFT1VZ8k4NHt49NNskKHkRaPEgprEpvZZPMiqBd6cPHv3cn3a4IXryFhBCYIcxMMt_3AZyj7QlnIxmIfsapQAmD1BkZ2ChMueJR7Dzq_fFOjUZ6MonHNbiosDCI6JvPsEND_5fvcruip7KuJN0-zetQV0qusVrVewqFKiHE5ieyF3evkwG1bvFOue2XfooPH8Pm_w7egfYPDo-NqwizCzWc7UFzI8TAynvZgquEiI0vWfL1aojnl-FnSqy_jPo_2ZNnliYHsCJD9XPZS_7BytasRRsehoP7m9ug1EUI0qJYWgaRNJZ45a2OHDcxZparjBtlXKSdtEYUOYkr8rSQlM2F1XFPumJdCsP7RvcV34fGLJ_hATCpVCYMOsvRiNBkhlvBHSqdysK7IR5Ci6wxfVtTX0xLQxz9PX0M22TwNVTvBBrL-QpPYcu-L58X8zPvr2-Mbpb9
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LSwMxEB5qFfRUtRXf5uDRbbeb5woeRFoq1lJold7K5rEiaCt9-Pj37qTbiuDFW0gIgRnCzCTzfR_AuTMhs4aLgNVTigVKFCRWi8DwKKGS8th61PtjW3Y6ajCIuwW4WGFhnHO--cxVcej_8u3YzPGprCZQt0_RNVjnjEXhAq21elHBYMUYW_5FhnHtutfA5i1azTf-UlDxAaRZ-t_R21D5QeKR7irG7EDBjXahtJRiIPnNLMNVD6mNL0nv61Uj0y9xnwny_hLsACVPnlsaXUCyHNXPpS_jD5I3Z00r8NBs9G9aQa6MECRZuTQLuNAGmeWN4pbq2KWGypRqqS1XVhjNsqzEZplahNrmzKg4FDZbF0zTulZ1SfegOBqP3D4QIWXKtLOGOs0inWpqGLVOqkRk_o3cAZTRGsO3BfnFMDfE4d_TZ7DZ6t-3h-3bzt0RbKHxF8C9YyjOJnN3AhvmffY8nZx6330Dl7SaRA
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2013+IEEE%2FACM+28th+International+Conference+on+Automated+Software+Engineering+%28ASE%29&rft.atitle=SEDGE%3A+Symbolic+example+data+generation+for+dataflow+programs&rft.au=Kaituo+Li&rft.au=Reichenbach%2C+Christoph&rft.au=Smaragdakis%2C+Yannis&rft.au=Diao%2C+Yanlei&rft.date=2013-11-01&rft.pub=IEEE&rft.spage=235&rft.epage=245&rft_id=info:doi/10.1109%2FASE.2013.6693083&rft.externalDocID=6693083