LLM4TDG: test-driven generation of large language models based on enhanced constraint reasoning.

Saved in:
Bibliographic Details
Title: LLM4TDG: test-driven generation of large language models based on enhanced constraint reasoning.
Authors: Liu, Jingqiang, Liang, Ruigang, Zhu, Xiaoxi, Zhang, Yue, Liu, Yuling, Liu, Qixu
Source: Cybersecurity (2523-3246); 5/15/2025, Vol. 8 Issue 1, p1-23, 23p
Subject Terms: LANGUAGE models, COMPUTER software testing, COMPUTER software reusability, EXPERIMENTAL design, CONSTRAINT satisfaction, CODE generators
Abstract: With the evolution of modern software development paradigms, component reuse, and low-code approaches have emerged as mainstream in software development. However, developers often lack an in-depth understanding of reused code. The inability of components to operate autonomously leads to insufficient testing of software functionalities and security, further exacerbating the contradiction between the increasing complexity of software architectures and the demand for accurate and efficient software automation testing. This, in turn, increases the frequency of software supply chain security incidents. This paper proposes a test-driven generation framework, LLM4TDG, based on large language models (LLMs). By formally defining the constraint dependency graph and converting it into context constraints, LLMs' ability to understand natural language descriptions such as test requirements and documents is enhanced. Constraint reasoning and backtracking mechanisms are then used to generate test drivers that satisfy the defined constraints automatically. Using the EvalPlus dataset, we evaluate the comprehensive capabilities of LLM4TDG in test case generation using four general-domain LLMs and five code-generation-domain LLMs. The experimental results indicate that our approach significantly enhances LLMs' ability to comprehend constraints in testing objectives, achieving a 47.62% increase in constraint understanding across 147 testing tasks. Employing LLM4TDG significantly improves the average pass@k metric of all LLMs by 10.41%. The pass@k metric for CodeQwen-chat has improved by up to 18.66%. The metric surpasses the state-of-the-art GPT-4, with a performance of 92.16% on HUMANEVAL and 87.14% on HUMANEVAL+, which enhances the error correction and functional correctness in test-driven code generation. Meanwhile, Our experiments were conducted on a dataset of Python third-party libraries containing malicious behavior in the context of security testing tasks, validating the effectiveness of our method in real-world applications and its generalization capabilities. [ABSTRACT FROM AUTHOR]
Copyright of Cybersecurity (2523-3246) is the property of Springer Nature and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Database: Complementary Index
Description
Abstract:With the evolution of modern software development paradigms, component reuse, and low-code approaches have emerged as mainstream in software development. However, developers often lack an in-depth understanding of reused code. The inability of components to operate autonomously leads to insufficient testing of software functionalities and security, further exacerbating the contradiction between the increasing complexity of software architectures and the demand for accurate and efficient software automation testing. This, in turn, increases the frequency of software supply chain security incidents. This paper proposes a test-driven generation framework, LLM4TDG, based on large language models (LLMs). By formally defining the constraint dependency graph and converting it into context constraints, LLMs' ability to understand natural language descriptions such as test requirements and documents is enhanced. Constraint reasoning and backtracking mechanisms are then used to generate test drivers that satisfy the defined constraints automatically. Using the EvalPlus dataset, we evaluate the comprehensive capabilities of LLM4TDG in test case generation using four general-domain LLMs and five code-generation-domain LLMs. The experimental results indicate that our approach significantly enhances LLMs' ability to comprehend constraints in testing objectives, achieving a 47.62% increase in constraint understanding across 147 testing tasks. Employing LLM4TDG significantly improves the average pass@k metric of all LLMs by 10.41%. The pass@k metric for CodeQwen-chat has improved by up to 18.66%. The metric surpasses the state-of-the-art GPT-4, with a performance of 92.16% on HUMANEVAL and 87.14% on HUMANEVAL+, which enhances the error correction and functional correctness in test-driven code generation. Meanwhile, Our experiments were conducted on a dataset of Python third-party libraries containing malicious behavior in the context of security testing tasks, validating the effectiveness of our method in real-world applications and its generalization capabilities. [ABSTRACT FROM AUTHOR]
ISSN:25233246
DOI:10.1186/s42400-024-00335-4