Sifting through the Chaff: On Utilizing Execution Feedback for Ranking the Generated Code Candidates

Large Language Models (LLMs), such as GPT-4, StarCoder, and Code Llama, are transforming the way developers approach programming by automatically generating code based on given contexts, such as natural language descriptions or incomplete surrounding code. Despite advancements, generating syntactica...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	IEEE/ACM International Conference on Automated Software Engineering : [proceedings] s. 229 - 241
Hlavní autoři:	Sun, Zhihong, Wan, Yao, Li, Jia, Zhang, Hongyu, Jin, Zhi, Li, Ge, Lyu, Chen
Médium:	Konferenční příspěvek
Jazyk:	angličtina
Vydáno:	ACM 27.10.2024
Témata:	Benchmark testing Cause effect analysis Code Generation Code Ranking Codes Execution Feedback Large language models Multitasking Natural languages Security Software Software engineering Source coding
ISSN:	2643-1572
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Popis
Shrnutí:	Large Language Models (LLMs), such as GPT-4, StarCoder, and Code Llama, are transforming the way developers approach programming by automatically generating code based on given contexts, such as natural language descriptions or incomplete surrounding code. Despite advancements, generating syntactically and semantically correct code remains challenging, especially for complex programming tasks. Existing approaches typically generate multiple candidate solutions using LLMs to increase the likelihood of producing correct code. However, selecting the correct code from these candidates - a process known as code ranking - remains a major challenge. Current research on code ranking can be cate-gorized into execution-based and non-execution-based methods. Execution-based methods, although effective, encounter notable limitations, such as scarcity of quality unit tests and security risks. Non-execution-based methods like CodeRanker, which rely solely on classification labels to train a code ranker, struggle to capture subtle errors and provide detailed error insights. Recognizing the strengths and limitations of both approaches, we propose a new method that integrates the advantages of execution-based and non-execution-based techniques. The key insight of our work is that an effective code ranker is expected to truly comprehend the underlying causes of erroneous code, as relying solely on classification labels is insufficient. Inspired by this, this paper puts forward RankEF, an innovative approach for code ranking that leverages execution feedback. RankEF employs multi-task learning to integrate code classification with execution feedback generation. This approach enables the model to understand the reasons behind incorrect code, distinguishing between correct and incorrect solutions without the need to execute the code during the ranking phase. Experiments on three code generation benchmarks-APPS, MBPP, and HumanEval-demonstrate that RankEF significantly outperforms the state-of-the-art CodeRanker, achieving relative improvements of +30.97%, +31.43%, and +19.51% in Pass@1, Pass@2, and Pass@5 on APPS test, respectively.CCS CONCEPTS* Software and its engineering → Automatic programming.
ISSN:	2643-1572
DOI:	10.1145/3691620.3695000