Nuances are the Key: Unlocking ChatGPT to Find Failure-Inducing Tests with Differential Prompting

Automated detection of software failures is an important but challenging software engineering task. It involves finding in a vast search space the failure-inducing test cases that contain an input triggering the software fault and an oracle asserting the incorrect execution. We are motivated to stud...

Full description

Saved in:
Bibliographic Details
Published in:IEEE/ACM International Conference on Automated Software Engineering : [proceedings] pp. 14 - 26
Main Authors: Li, Tsz-On, Zong, Wenxi, Wang, Yibo, Tian, Haoye, Wang, Ying, Cheung, Shing-Chi, Kramer, Jeff
Format: Conference Proceeding
Language:English
Published: IEEE 11.09.2023
Subjects:
ISSN:2643-1572
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract Automated detection of software failures is an important but challenging software engineering task. It involves finding in a vast search space the failure-inducing test cases that contain an input triggering the software fault and an oracle asserting the incorrect execution. We are motivated to study how far this outstanding challenge can be solved by recent advances in large language models (LLMs) such as ChatGPT. However, our study reveals that ChatGPT has a relatively low success rate (28.8%) in finding correct failure-inducing test cases for buggy programs. A possible conjecture is that finding failure-inducing test cases requires analyzing the subtle differences (nuances) between the tokens of a program's correct version and those for its buggy version. When these two versions have similar sets of tokens and attentions, ChatGPT is weak in distinguishing their differences. We find that ChatGPT can successfully generate failure-inducing test cases when it is guided to focus on the nuances. Our solution is inspired by an interesting observation that ChatGPT could infer the intended functionality of buggy code if it is similar to the correct version. Driven by the inspiration, we develop a novel technique, called Differential Prompting, to effectively find failure-inducing test cases with the help of the compilable code synthesized by the inferred intention. Prompts are constructed based on the nuances between the given version and the synthesized code. We evaluate Differential Prompting on Quixbugs (a popular benchmark of buggy programs) and recent programs published at Codeforces (a popular programming contest portal, which is also an official benchmark of ChatGPT). We compare Differential Prompting with two baselines constructed using conventional ChatGPT prompting and Pynguin (the state-of-the-art unit test generation tool for Python programs). Our evaluation results show that for programs of Quixbugs, Differential Prompting can achieve a success rate of 75.0% in finding failure-inducing test cases, outperforming the best baseline by 2.6X. For programs of Codeforces, Differential Prompting's success rate is 66.7%, outperforming the best baseline by 4.0X.
AbstractList Automated detection of software failures is an important but challenging software engineering task. It involves finding in a vast search space the failure-inducing test cases that contain an input triggering the software fault and an oracle asserting the incorrect execution. We are motivated to study how far this outstanding challenge can be solved by recent advances in large language models (LLMs) such as ChatGPT. However, our study reveals that ChatGPT has a relatively low success rate (28.8%) in finding correct failure-inducing test cases for buggy programs. A possible conjecture is that finding failure-inducing test cases requires analyzing the subtle differences (nuances) between the tokens of a program's correct version and those for its buggy version. When these two versions have similar sets of tokens and attentions, ChatGPT is weak in distinguishing their differences. We find that ChatGPT can successfully generate failure-inducing test cases when it is guided to focus on the nuances. Our solution is inspired by an interesting observation that ChatGPT could infer the intended functionality of buggy code if it is similar to the correct version. Driven by the inspiration, we develop a novel technique, called Differential Prompting, to effectively find failure-inducing test cases with the help of the compilable code synthesized by the inferred intention. Prompts are constructed based on the nuances between the given version and the synthesized code. We evaluate Differential Prompting on Quixbugs (a popular benchmark of buggy programs) and recent programs published at Codeforces (a popular programming contest portal, which is also an official benchmark of ChatGPT). We compare Differential Prompting with two baselines constructed using conventional ChatGPT prompting and Pynguin (the state-of-the-art unit test generation tool for Python programs). Our evaluation results show that for programs of Quixbugs, Differential Prompting can achieve a success rate of 75.0% in finding failure-inducing test cases, outperforming the best baseline by 2.6X. For programs of Codeforces, Differential Prompting's success rate is 66.7%, outperforming the best baseline by 4.0X.
Author Cheung, Shing-Chi
Tian, Haoye
Wang, Ying
Zong, Wenxi
Li, Tsz-On
Wang, Yibo
Kramer, Jeff
Author_xml – sequence: 1
  givenname: Tsz-On
  surname: Li
  fullname: Li, Tsz-On
  email: toli@cse.ust.hk
  organization: The Hong Kong University of Science and Technology,Hong Kong,China
– sequence: 2
  givenname: Wenxi
  surname: Zong
  fullname: Zong, Wenxi
  email: iamwenxiz@163.com
  organization: Northeastern University,Shenyang,China
– sequence: 3
  givenname: Yibo
  surname: Wang
  fullname: Wang, Yibo
  email: yibowangcz@outlook.com
  organization: Northeastern University,Shenyang,China
– sequence: 4
  givenname: Haoye
  surname: Tian
  fullname: Tian, Haoye
  email: haoye.tian@uni.lu
  organization: University of Luxembourg,Luxembourg,Luxembourg
– sequence: 5
  givenname: Ying
  surname: Wang
  fullname: Wang, Ying
  email: wangying@swc.neu.edu.cn
  organization: The Hong Kong University of Science and Technology,Hong Kong,China
– sequence: 6
  givenname: Shing-Chi
  surname: Cheung
  fullname: Cheung, Shing-Chi
  email: scc@cse.ust.hk
  organization: The Hong Kong University of Science and Technology,Hong Kong,China
– sequence: 7
  givenname: Jeff
  surname: Kramer
  fullname: Kramer, Jeff
  email: j.kramer@imperial.ac.uk
  organization: Imperial College London,London,United Kingdom
BookMark eNotjN1OwjAYQKvRRECeQC_6AsP-rN3qHUGGRKIkjmvyrf0m1dGRrcTw9mL06uQkJ2dIrkIbkJA7ziacM_MwfZ8rLYSZCCbkhDGWmwsyNpnJpWJSGKPTSzIQOpUJV5m4IcO-_2RMnSUbEHg9QrDYU-iQxh3SFzw90k1oWvvlwwed7SAu1iWNLS18cLQA3xw7TJbBHe1vUGIfe_rt444--brGDkP00NB11-4P8Vzckusamh7H_xyRTTEvZ8_J6m2xnE1XCYg8jYmpWaYQNJxROS2tTXUFFiQKdMakVingWCHjTrsqZQBCOc2ZqipUzgo5Ivd_X4-I20Pn99CdtpwJkyuZyx8gDVgs
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/ASE56229.2023.00089
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
Accès Toulouse INP et ENVT - IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Xplore
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISBN 9798350329964
EISSN 2643-1572
EndPage 26
ExternalDocumentID 10298538
Genre orig-research
GrantInformation_xml – fundername: National Science Foundation of China
  grantid: 61932021,62141210
  funderid: 10.13039/501100001809
– fundername: Fundamental Research Funds for the Central Universities
  grantid: N2217005
  funderid: 10.13039/501100012226
– fundername: State Key Lab. for Novel Software Technology
  funderid: 10.13039/501100011246
– fundername: Nanjing University
  grantid: KFKT2021B01
  funderid: 10.13039/501100008048
GroupedDBID 6IE
6IF
6IH
6IK
6IL
6IM
6IN
6J9
AAJGR
AAWTH
ABLEC
ACREN
ADYOE
ADZIZ
AFYQB
ALMA_UNASSIGNED_HOLDINGS
AMTXH
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
IPLJI
M43
OCL
RIE
RIL
ID FETCH-LOGICAL-a284t-9f075ea6a075bd63cc46baca3e2ed994c55a1ebe01d6db40aa25d6105bbe5dc23
IEDL.DBID RIE
ISICitedReferencesCount 32
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001103357200002&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate Wed Aug 27 02:32:41 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a284t-9f075ea6a075bd63cc46baca3e2ed994c55a1ebe01d6db40aa25d6105bbe5dc23
PageCount 13
ParticipantIDs ieee_primary_10298538
PublicationCentury 2000
PublicationDate 2023-Sept.-11
PublicationDateYYYYMMDD 2023-09-11
PublicationDate_xml – month: 09
  year: 2023
  text: 2023-Sept.-11
  day: 11
PublicationDecade 2020
PublicationTitle IEEE/ACM International Conference on Automated Software Engineering : [proceedings]
PublicationTitleAbbrev ASE
PublicationYear 2023
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0051577
ssib057256115
Score 2.5025299
Snippet Automated detection of software failures is an important but challenging software engineering task. It involves finding in a vast search space the...
SourceID ieee
SourceType Publisher
StartPage 14
SubjectTerms Benchmark testing
Chatbots
Codes
failure-inducing test cases
large language models
program generation
program intention inference
Programming
Software
Task analysis
Test pattern generators
Title Nuances are the Key: Unlocking ChatGPT to Find Failure-Inducing Tests with Differential Prompting
URI https://ieeexplore.ieee.org/document/10298538
WOSCitedRecordID wos001103357200002&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LT8MwDI7YxIHTeAzxVg5cC03bNC03NFaQkKZJbNJuk5O6YhJ009Yh8e-xuw64cODUqEraKK7zOa7tT4jr2BRcAEB5iTbgRZYOrJCTumOe0NYYFAYd1GQTZjBIJpN02CSr17kwiFgHn-ENN-t_-fncrdlVRhoepAQvSUu0jIk3yVrbj0cbAm-lvm1fwmljmjJDyk9v71_6BPUB56YEXNTUZ2L3X4QqNZ5knX_OZF90fzLz5PAbcw7EDpaHorOlZpCNph4JGKxZnCsJS5Rk48ln_LyT45Kgi33jsvcK1eNwJKu5zOhULjOYcXy6x0wejjuMCC1Wkr208qHhUKG94I1f_r7gUOmuGGf9Ue_Ja9gUPCAIqry0IOsAIQa62DwOnYtiCw5CDDBP08hpDYpE6ivmmIp8gEDnZFxpa1HnLgiPRbucl3giZFhYGq2spqdFTiUQEc4X1D20RjvlTkWXl2y62BTMmG5X6-yP--dij6XCYRhKXYh2tVzjpdh1H9VstbyqxfwFopmo5A
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LT8MwDI5gIMFpPIZ4kwPXQpM2TcsNjZWhjWoSncRtStJUTIJu2jok_j121w0uHDg1qpI2iut8jmv7I-Q6kDkWAGBOKKRyfA0HVpWButsshK2R59IaVZFNyCQJX1-jQZ2sXuXCWGur4DN7g83qX342MQt0lYGG8wjgJdwkW8L3ubtM11p9PkICfDO2tn4BqaWsCw0xN7q9f-kA2HPMTuFY1tRFavdflCoVosTNf85lj7R-cvPoYI06-2TDFgekuSJnoLWuHhKVLFCgc6pmloKVR3v2644OCwAv9I7T9psqHwcpLSc0hnM5jdUYI9Qd5PIw2CEFvJhT9NPSh5pFBXaDd3z5xxSDpVtkGHfSdtep-RQcBSBUOlEO9oFVgYKLzgLPGD_QyijPcptFkW-EUAyE6jJkmfJdpbjIwLwSWluRGe4dkUYxKewxoV6uYTTTAp7mGxYqEEeQQ3dPS2GYOSEtXLLRdFkyY7RardM_7l-RnW763B_1n5LeGdlFCWFQBmPnpFHOFvaCbJvPcjyfXVYi_waYx6wr
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=IEEE%2FACM+International+Conference+on+Automated+Software+Engineering+%3A+%5Bproceedings%5D&rft.atitle=Nuances+are+the+Key%3A+Unlocking+ChatGPT+to+Find+Failure-Inducing+Tests+with+Differential+Prompting&rft.au=Li%2C+Tsz-On&rft.au=Zong%2C+Wenxi&rft.au=Wang%2C+Yibo&rft.au=Tian%2C+Haoye&rft.date=2023-09-11&rft.pub=IEEE&rft.eissn=2643-1572&rft.spage=14&rft.epage=26&rft_id=info:doi/10.1109%2FASE56229.2023.00089&rft.externalDocID=10298538