Nuances are the Key: Unlocking ChatGPT to Find Failure-Inducing Tests with Differential Prompting

Automated detection of software failures is an important but challenging software engineering task. It involves finding in a vast search space the failure-inducing test cases that contain an input triggering the software fault and an oracle asserting the incorrect execution. We are motivated to stud...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE/ACM International Conference on Automated Software Engineering : [proceedings] pp. 14 - 26
Main Authors:	Li, Tsz-On, Zong, Wenxi, Wang, Yibo, Tian, Haoye, Wang, Ying, Cheung, Shing-Chi, Kramer, Jeff
Format:	Conference Proceeding
Language:	English
Published:	IEEE 11.09.2023
Subjects:	Benchmark testing Chatbots Codes failure-inducing test cases large language models program generation program intention inference Programming Software Task analysis Test pattern generators
ISSN:	2643-1572
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Abstract	Automated detection of software failures is an important but challenging software engineering task. It involves finding in a vast search space the failure-inducing test cases that contain an input triggering the software fault and an oracle asserting the incorrect execution. We are motivated to study how far this outstanding challenge can be solved by recent advances in large language models (LLMs) such as ChatGPT. However, our study reveals that ChatGPT has a relatively low success rate (28.8%) in finding correct failure-inducing test cases for buggy programs. A possible conjecture is that finding failure-inducing test cases requires analyzing the subtle differences (nuances) between the tokens of a program's correct version and those for its buggy version. When these two versions have similar sets of tokens and attentions, ChatGPT is weak in distinguishing their differences. We find that ChatGPT can successfully generate failure-inducing test cases when it is guided to focus on the nuances. Our solution is inspired by an interesting observation that ChatGPT could infer the intended functionality of buggy code if it is similar to the correct version. Driven by the inspiration, we develop a novel technique, called Differential Prompting, to effectively find failure-inducing test cases with the help of the compilable code synthesized by the inferred intention. Prompts are constructed based on the nuances between the given version and the synthesized code. We evaluate Differential Prompting on Quixbugs (a popular benchmark of buggy programs) and recent programs published at Codeforces (a popular programming contest portal, which is also an official benchmark of ChatGPT). We compare Differential Prompting with two baselines constructed using conventional ChatGPT prompting and Pynguin (the state-of-the-art unit test generation tool for Python programs). Our evaluation results show that for programs of Quixbugs, Differential Prompting can achieve a success rate of 75.0% in finding failure-inducing test cases, outperforming the best baseline by 2.6X. For programs of Codeforces, Differential Prompting's success rate is 66.7%, outperforming the best baseline by 4.0X.
AbstractList	Automated detection of software failures is an important but challenging software engineering task. It involves finding in a vast search space the failure-inducing test cases that contain an input triggering the software fault and an oracle asserting the incorrect execution. We are motivated to study how far this outstanding challenge can be solved by recent advances in large language models (LLMs) such as ChatGPT. However, our study reveals that ChatGPT has a relatively low success rate (28.8%) in finding correct failure-inducing test cases for buggy programs. A possible conjecture is that finding failure-inducing test cases requires analyzing the subtle differences (nuances) between the tokens of a program's correct version and those for its buggy version. When these two versions have similar sets of tokens and attentions, ChatGPT is weak in distinguishing their differences. We find that ChatGPT can successfully generate failure-inducing test cases when it is guided to focus on the nuances. Our solution is inspired by an interesting observation that ChatGPT could infer the intended functionality of buggy code if it is similar to the correct version. Driven by the inspiration, we develop a novel technique, called Differential Prompting, to effectively find failure-inducing test cases with the help of the compilable code synthesized by the inferred intention. Prompts are constructed based on the nuances between the given version and the synthesized code. We evaluate Differential Prompting on Quixbugs (a popular benchmark of buggy programs) and recent programs published at Codeforces (a popular programming contest portal, which is also an official benchmark of ChatGPT). We compare Differential Prompting with two baselines constructed using conventional ChatGPT prompting and Pynguin (the state-of-the-art unit test generation tool for Python programs). Our evaluation results show that for programs of Quixbugs, Differential Prompting can achieve a success rate of 75.0% in finding failure-inducing test cases, outperforming the best baseline by 2.6X. For programs of Codeforces, Differential Prompting's success rate is 66.7%, outperforming the best baseline by 4.0X.
Author	Cheung, Shing-Chi Tian, Haoye Wang, Ying Zong, Wenxi Li, Tsz-On Wang, Yibo Kramer, Jeff
Author_xml	– sequence: 1 givenname: Tsz-On surname: Li fullname: Li, Tsz-On email: toli@cse.ust.hk organization: The Hong Kong University of Science and Technology,Hong Kong,China – sequence: 2 givenname: Wenxi surname: Zong fullname: Zong, Wenxi email: iamwenxiz@163.com organization: Northeastern University,Shenyang,China – sequence: 3 givenname: Yibo surname: Wang fullname: Wang, Yibo email: yibowangcz@outlook.com organization: Northeastern University,Shenyang,China – sequence: 4 givenname: Haoye surname: Tian fullname: Tian, Haoye email: haoye.tian@uni.lu organization: University of Luxembourg,Luxembourg,Luxembourg – sequence: 5 givenname: Ying surname: Wang fullname: Wang, Ying email: wangying@swc.neu.edu.cn organization: The Hong Kong University of Science and Technology,Hong Kong,China – sequence: 6 givenname: Shing-Chi surname: Cheung fullname: Cheung, Shing-Chi email: scc@cse.ust.hk organization: The Hong Kong University of Science and Technology,Hong Kong,China – sequence: 7 givenname: Jeff surname: Kramer fullname: Kramer, Jeff email: j.kramer@imperial.ac.uk organization: Imperial College London,London,United Kingdom
BookMark	eNotjN1OwjAYQKvRRECeQC_6AsP-rN3qHUGGRKIkjmvyrf0m1dGRrcTw9mL06uQkJ2dIrkIbkJA7ziacM_MwfZ8rLYSZCCbkhDGWmwsyNpnJpWJSGKPTSzIQOpUJV5m4IcO-_2RMnSUbEHg9QrDYU-iQxh3SFzw90k1oWvvlwwed7SAu1iWNLS18cLQA3xw7TJbBHe1vUGIfe_rt444--brGDkP00NB11-4P8Vzckusamh7H_xyRTTEvZ8_J6m2xnE1XCYg8jYmpWaYQNJxROS2tTXUFFiQKdMakVingWCHjTrsqZQBCOc2ZqipUzgo5Ivd_X4-I20Pn99CdtpwJkyuZyx8gDVgs
CODEN	IEEPAD
ContentType	Conference Proceeding
DBID	6IE 6IL CBEJK RIE RIL
DOI	10.1109/ASE56229.2023.00089
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings Accès Toulouse INP et ENVT - IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE Xplore url: https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
Discipline	Computer Science
EISBN	9798350329964
EISSN	2643-1572
EndPage	26
ExternalDocumentID	10298538
Genre	orig-research
GrantInformation_xml	– fundername: National Science Foundation of China grantid: 61932021,62141210 funderid: 10.13039/501100001809 – fundername: Fundamental Research Funds for the Central Universities grantid: N2217005 funderid: 10.13039/501100012226 – fundername: State Key Lab. for Novel Software Technology funderid: 10.13039/501100011246 – fundername: Nanjing University grantid: KFKT2021B01 funderid: 10.13039/501100008048
GroupedDBID	6IE 6IF 6IH 6IK 6IL 6IM 6IN 6J9 AAJGR AAWTH ABLEC ACREN ADYOE ADZIZ AFYQB ALMA_UNASSIGNED_HOLDINGS AMTXH BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IPLJI M43 OCL RIE RIL
ID	FETCH-LOGICAL-a284t-9f075ea6a075bd63cc46baca3e2ed994c55a1ebe01d6db40aa25d6105bbe5dc23
IEDL.DBID	RIE
ISICitedReferencesCount	32
ISICitedReferencesURI	http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001103357200002&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate	Wed Aug 27 02:32:41 EDT 2025
IsPeerReviewed	false
IsScholarly	true
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-a284t-9f075ea6a075bd63cc46baca3e2ed994c55a1ebe01d6db40aa25d6105bbe5dc23
PageCount	13
ParticipantIDs	ieee_primary_10298538
PublicationCentury	2000
PublicationDate	2023-Sept.-11
PublicationDateYYYYMMDD	2023-09-11
PublicationDate_xml	– month: 09 year: 2023 text: 2023-Sept.-11 day: 11
PublicationDecade	2020
PublicationTitle	IEEE/ACM International Conference on Automated Software Engineering : [proceedings]
PublicationTitleAbbrev	ASE
PublicationYear	2023
Publisher	IEEE
Publisher_xml	– name: IEEE
SSID	ssj0051577 ssib057256115
Score	2.5025299
Snippet	Automated detection of software failures is an important but challenging software engineering task. It involves finding in a vast search space the...
SourceID	ieee
SourceType	Publisher
StartPage	14
SubjectTerms	Benchmark testing Chatbots Codes failure-inducing test cases large language models program generation program intention inference Programming Software Task analysis Test pattern generators
Title	Nuances are the Key: Unlocking ChatGPT to Find Failure-Inducing Tests with Differential Prompting
URI	https://ieeexplore.ieee.org/document/10298538
WOSCitedRecordID	wos001103357200002&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LT8MwDI7YxIHTeAzxVg5cC03bNC03NFaQkKZJbNJuk5O6YhJ009Yh8e-xuw64cODUqEraKK7zOa7tT4jr2BRcAEB5iTbgRZYOrJCTumOe0NYYFAYd1GQTZjBIJpN02CSr17kwiFgHn-ENN-t_-fncrdlVRhoepAQvSUu0jIk3yVrbj0cbAm-lvm1fwmljmjJDyk9v71_6BPUB56YEXNTUZ2L3X4QqNZ5knX_OZF90fzLz5PAbcw7EDpaHorOlZpCNph4JGKxZnCsJS5Rk48ln_LyT45Kgi33jsvcK1eNwJKu5zOhULjOYcXy6x0wejjuMCC1Wkr208qHhUKG94I1f_r7gUOmuGGf9Ue_Ja9gUPCAIqry0IOsAIQa62DwOnYtiCw5CDDBP08hpDYpE6ivmmIp8gEDnZFxpa1HnLgiPRbucl3giZFhYGq2spqdFTiUQEc4X1D20RjvlTkWXl2y62BTMmG5X6-yP--dij6XCYRhKXYh2tVzjpdh1H9VstbyqxfwFopmo5A
linkProvider	IEEE
linkToHtml	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LT8MwDI5gIMFpPIZ4kwPXQpM2TcsNjZWhjWoSncRtStJUTIJu2jok_j121w0uHDg1qpI2iut8jmv7I-Q6kDkWAGBOKKRyfA0HVpWButsshK2R59IaVZFNyCQJX1-jQZ2sXuXCWGur4DN7g83qX342MQt0lYGG8wjgJdwkW8L3ubtM11p9PkICfDO2tn4BqaWsCw0xN7q9f-kA2HPMTuFY1tRFavdflCoVosTNf85lj7R-cvPoYI06-2TDFgekuSJnoLWuHhKVLFCgc6pmloKVR3v2644OCwAv9I7T9psqHwcpLSc0hnM5jdUYI9Qd5PIw2CEFvJhT9NPSh5pFBXaDd3z5xxSDpVtkGHfSdtep-RQcBSBUOlEO9oFVgYKLzgLPGD_QyijPcptFkW-EUAyE6jJkmfJdpbjIwLwSWluRGe4dkUYxKewxoV6uYTTTAp7mGxYqEEeQQ3dPS2GYOSEtXLLRdFkyY7RardM_7l-RnW763B_1n5LeGdlFCWFQBmPnpFHOFvaCbJvPcjyfXVYi_waYx6wr
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=IEEE%2FACM+International+Conference+on+Automated+Software+Engineering+%3A+%5Bproceedings%5D&rft.atitle=Nuances+are+the+Key%3A+Unlocking+ChatGPT+to+Find+Failure-Inducing+Tests+with+Differential+Prompting&rft.au=Li%2C+Tsz-On&rft.au=Zong%2C+Wenxi&rft.au=Wang%2C+Yibo&rft.au=Tian%2C+Haoye&rft.date=2023-09-11&rft.pub=IEEE&rft.eissn=2643-1572&rft.spage=14&rft.epage=26&rft_id=info:doi/10.1109%2FASE56229.2023.00089&rft.externalDocID=10298538