Nuances are the Key: Unlocking ChatGPT to Find Failure-Inducing Tests with Differential Prompting
Automated detection of software failures is an important but challenging software engineering task. It involves finding in a vast search space the failure-inducing test cases that contain an input triggering the software fault and an oracle asserting the incorrect execution. We are motivated to stud...
Saved in:
| Published in: | IEEE/ACM International Conference on Automated Software Engineering : [proceedings] pp. 14 - 26 |
|---|---|
| Main Authors: | , , , , , , |
| Format: | Conference Proceeding |
| Language: | English |
| Published: |
IEEE
11.09.2023
|
| Subjects: | |
| ISSN: | 2643-1572 |
| Online Access: | Get full text |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Abstract | Automated detection of software failures is an important but challenging software engineering task. It involves finding in a vast search space the failure-inducing test cases that contain an input triggering the software fault and an oracle asserting the incorrect execution. We are motivated to study how far this outstanding challenge can be solved by recent advances in large language models (LLMs) such as ChatGPT. However, our study reveals that ChatGPT has a relatively low success rate (28.8%) in finding correct failure-inducing test cases for buggy programs. A possible conjecture is that finding failure-inducing test cases requires analyzing the subtle differences (nuances) between the tokens of a program's correct version and those for its buggy version. When these two versions have similar sets of tokens and attentions, ChatGPT is weak in distinguishing their differences. We find that ChatGPT can successfully generate failure-inducing test cases when it is guided to focus on the nuances. Our solution is inspired by an interesting observation that ChatGPT could infer the intended functionality of buggy code if it is similar to the correct version. Driven by the inspiration, we develop a novel technique, called Differential Prompting, to effectively find failure-inducing test cases with the help of the compilable code synthesized by the inferred intention. Prompts are constructed based on the nuances between the given version and the synthesized code. We evaluate Differential Prompting on Quixbugs (a popular benchmark of buggy programs) and recent programs published at Codeforces (a popular programming contest portal, which is also an official benchmark of ChatGPT). We compare Differential Prompting with two baselines constructed using conventional ChatGPT prompting and Pynguin (the state-of-the-art unit test generation tool for Python programs). Our evaluation results show that for programs of Quixbugs, Differential Prompting can achieve a success rate of 75.0% in finding failure-inducing test cases, outperforming the best baseline by 2.6X. For programs of Codeforces, Differential Prompting's success rate is 66.7%, outperforming the best baseline by 4.0X. |
|---|---|
| AbstractList | Automated detection of software failures is an important but challenging software engineering task. It involves finding in a vast search space the failure-inducing test cases that contain an input triggering the software fault and an oracle asserting the incorrect execution. We are motivated to study how far this outstanding challenge can be solved by recent advances in large language models (LLMs) such as ChatGPT. However, our study reveals that ChatGPT has a relatively low success rate (28.8%) in finding correct failure-inducing test cases for buggy programs. A possible conjecture is that finding failure-inducing test cases requires analyzing the subtle differences (nuances) between the tokens of a program's correct version and those for its buggy version. When these two versions have similar sets of tokens and attentions, ChatGPT is weak in distinguishing their differences. We find that ChatGPT can successfully generate failure-inducing test cases when it is guided to focus on the nuances. Our solution is inspired by an interesting observation that ChatGPT could infer the intended functionality of buggy code if it is similar to the correct version. Driven by the inspiration, we develop a novel technique, called Differential Prompting, to effectively find failure-inducing test cases with the help of the compilable code synthesized by the inferred intention. Prompts are constructed based on the nuances between the given version and the synthesized code. We evaluate Differential Prompting on Quixbugs (a popular benchmark of buggy programs) and recent programs published at Codeforces (a popular programming contest portal, which is also an official benchmark of ChatGPT). We compare Differential Prompting with two baselines constructed using conventional ChatGPT prompting and Pynguin (the state-of-the-art unit test generation tool for Python programs). Our evaluation results show that for programs of Quixbugs, Differential Prompting can achieve a success rate of 75.0% in finding failure-inducing test cases, outperforming the best baseline by 2.6X. For programs of Codeforces, Differential Prompting's success rate is 66.7%, outperforming the best baseline by 4.0X. |
| Author | Cheung, Shing-Chi Tian, Haoye Wang, Ying Zong, Wenxi Li, Tsz-On Wang, Yibo Kramer, Jeff |
| Author_xml | – sequence: 1 givenname: Tsz-On surname: Li fullname: Li, Tsz-On email: toli@cse.ust.hk organization: The Hong Kong University of Science and Technology,Hong Kong,China – sequence: 2 givenname: Wenxi surname: Zong fullname: Zong, Wenxi email: iamwenxiz@163.com organization: Northeastern University,Shenyang,China – sequence: 3 givenname: Yibo surname: Wang fullname: Wang, Yibo email: yibowangcz@outlook.com organization: Northeastern University,Shenyang,China – sequence: 4 givenname: Haoye surname: Tian fullname: Tian, Haoye email: haoye.tian@uni.lu organization: University of Luxembourg,Luxembourg,Luxembourg – sequence: 5 givenname: Ying surname: Wang fullname: Wang, Ying email: wangying@swc.neu.edu.cn organization: The Hong Kong University of Science and Technology,Hong Kong,China – sequence: 6 givenname: Shing-Chi surname: Cheung fullname: Cheung, Shing-Chi email: scc@cse.ust.hk organization: The Hong Kong University of Science and Technology,Hong Kong,China – sequence: 7 givenname: Jeff surname: Kramer fullname: Kramer, Jeff email: j.kramer@imperial.ac.uk organization: Imperial College London,London,United Kingdom |
| BookMark | eNotjN1OwjAYQKvRRECeQC_6AsP-rN3qHUGGRKIkjmvyrf0m1dGRrcTw9mL06uQkJ2dIrkIbkJA7ziacM_MwfZ8rLYSZCCbkhDGWmwsyNpnJpWJSGKPTSzIQOpUJV5m4IcO-_2RMnSUbEHg9QrDYU-iQxh3SFzw90k1oWvvlwwed7SAu1iWNLS18cLQA3xw7TJbBHe1vUGIfe_rt444--brGDkP00NB11-4P8Vzckusamh7H_xyRTTEvZ8_J6m2xnE1XCYg8jYmpWaYQNJxROS2tTXUFFiQKdMakVingWCHjTrsqZQBCOc2ZqipUzgo5Ivd_X4-I20Pn99CdtpwJkyuZyx8gDVgs |
| CODEN | IEEPAD |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IL CBEJK RIE RIL |
| DOI | 10.1109/ASE56229.2023.00089 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings Accès Toulouse INP et ENVT - IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Xplore url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Computer Science |
| EISBN | 9798350329964 |
| EISSN | 2643-1572 |
| EndPage | 26 |
| ExternalDocumentID | 10298538 |
| Genre | orig-research |
| GrantInformation_xml | – fundername: National Science Foundation of China grantid: 61932021,62141210 funderid: 10.13039/501100001809 – fundername: Fundamental Research Funds for the Central Universities grantid: N2217005 funderid: 10.13039/501100012226 – fundername: State Key Lab. for Novel Software Technology funderid: 10.13039/501100011246 – fundername: Nanjing University grantid: KFKT2021B01 funderid: 10.13039/501100008048 |
| GroupedDBID | 6IE 6IF 6IH 6IK 6IL 6IM 6IN 6J9 AAJGR AAWTH ABLEC ACREN ADYOE ADZIZ AFYQB ALMA_UNASSIGNED_HOLDINGS AMTXH BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IPLJI M43 OCL RIE RIL |
| ID | FETCH-LOGICAL-a284t-9f075ea6a075bd63cc46baca3e2ed994c55a1ebe01d6db40aa25d6105bbe5dc23 |
| IEDL.DBID | RIE |
| ISICitedReferencesCount | 32 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001103357200002&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| IngestDate | Wed Aug 27 02:32:41 EDT 2025 |
| IsPeerReviewed | false |
| IsScholarly | true |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-a284t-9f075ea6a075bd63cc46baca3e2ed994c55a1ebe01d6db40aa25d6105bbe5dc23 |
| PageCount | 13 |
| ParticipantIDs | ieee_primary_10298538 |
| PublicationCentury | 2000 |
| PublicationDate | 2023-Sept.-11 |
| PublicationDateYYYYMMDD | 2023-09-11 |
| PublicationDate_xml | – month: 09 year: 2023 text: 2023-Sept.-11 day: 11 |
| PublicationDecade | 2020 |
| PublicationTitle | IEEE/ACM International Conference on Automated Software Engineering : [proceedings] |
| PublicationTitleAbbrev | ASE |
| PublicationYear | 2023 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| SSID | ssj0051577 ssib057256115 |
| Score | 2.5025299 |
| Snippet | Automated detection of software failures is an important but challenging software engineering task. It involves finding in a vast search space the... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 14 |
| SubjectTerms | Benchmark testing Chatbots Codes failure-inducing test cases large language models program generation program intention inference Programming Software Task analysis Test pattern generators |
| Title | Nuances are the Key: Unlocking ChatGPT to Find Failure-Inducing Tests with Differential Prompting |
| URI | https://ieeexplore.ieee.org/document/10298538 |
| WOSCitedRecordID | wos001103357200002&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LT8MwDI7YxIHTeAzxVg5cC03bNC03NFaQkKZJbNJuk5O6YhJ009Yh8e-xuw64cODUqEraKK7zOa7tT4jr2BRcAEB5iTbgRZYOrJCTumOe0NYYFAYd1GQTZjBIJpN02CSr17kwiFgHn-ENN-t_-fncrdlVRhoepAQvSUu0jIk3yVrbj0cbAm-lvm1fwmljmjJDyk9v71_6BPUB56YEXNTUZ2L3X4QqNZ5knX_OZF90fzLz5PAbcw7EDpaHorOlZpCNph4JGKxZnCsJS5Rk48ln_LyT45Kgi33jsvcK1eNwJKu5zOhULjOYcXy6x0wejjuMCC1Wkr208qHhUKG94I1f_r7gUOmuGGf9Ue_Ja9gUPCAIqry0IOsAIQa62DwOnYtiCw5CDDBP08hpDYpE6ivmmIp8gEDnZFxpa1HnLgiPRbucl3giZFhYGq2spqdFTiUQEc4X1D20RjvlTkWXl2y62BTMmG5X6-yP--dij6XCYRhKXYh2tVzjpdh1H9VstbyqxfwFopmo5A |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LT8MwDI5gIMFpPIZ4kwPXQpM2TcsNjZWhjWoSncRtStJUTIJu2jok_j121w0uHDg1qpI2iut8jmv7I-Q6kDkWAGBOKKRyfA0HVpWButsshK2R59IaVZFNyCQJX1-jQZ2sXuXCWGur4DN7g83qX342MQt0lYGG8wjgJdwkW8L3ubtM11p9PkICfDO2tn4BqaWsCw0xN7q9f-kA2HPMTuFY1tRFavdflCoVosTNf85lj7R-cvPoYI06-2TDFgekuSJnoLWuHhKVLFCgc6pmloKVR3v2644OCwAv9I7T9psqHwcpLSc0hnM5jdUYI9Qd5PIw2CEFvJhT9NPSh5pFBXaDd3z5xxSDpVtkGHfSdtep-RQcBSBUOlEO9oFVgYKLzgLPGD_QyijPcptFkW-EUAyE6jJkmfJdpbjIwLwSWluRGe4dkUYxKewxoV6uYTTTAp7mGxYqEEeQQ3dPS2GYOSEtXLLRdFkyY7RardM_7l-RnW763B_1n5LeGdlFCWFQBmPnpFHOFvaCbJvPcjyfXVYi_waYx6wr |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=IEEE%2FACM+International+Conference+on+Automated+Software+Engineering+%3A+%5Bproceedings%5D&rft.atitle=Nuances+are+the+Key%3A+Unlocking+ChatGPT+to+Find+Failure-Inducing+Tests+with+Differential+Prompting&rft.au=Li%2C+Tsz-On&rft.au=Zong%2C+Wenxi&rft.au=Wang%2C+Yibo&rft.au=Tian%2C+Haoye&rft.date=2023-09-11&rft.pub=IEEE&rft.eissn=2643-1572&rft.spage=14&rft.epage=26&rft_id=info:doi/10.1109%2FASE56229.2023.00089&rft.externalDocID=10298538 |