Comprehending C codes with LLMs: Effective comment generation through retrieval and reasoning.

Uloženo v:
Podrobná bibliografie
Název: Comprehending C codes with LLMs: Effective comment generation through retrieval and reasoning.
Autoři: Majumdar, Srijoni1,2 (AUTHOR) s.majumdar@leeds.ac.uk, Deshpande, Adwita3 (AUTHOR) adwita.deshpande.22033@iitgoa.ac.in, Das, Partha Pratim4 (AUTHOR) partha.das@ashoka.edu.in, Chakrabarti, Partha Pratim2 (AUTHOR) ppchak@cse.iitkgp.ac.in
Zdroj: Pattern Recognition Letters. Jan2026, Vol. 199, p295-302. 8p.
Témata: *C (Computer program language), *LANGUAGE models, *EVALUATION methodology, *SOFTWARE maintenance, *ANNOTATIONS, *NATURAL language processing, *MACHINE learning
Abstrakt: Software maintenance requires substantial time for program comprehension. Code comments significantly improve understandability by providing a glass-box view of the code and are thus essential for maintainability. Prior work has analyzed comment attributes, built automated systems to detect irrelevant comments, and applied machine learning to generate meaningful comments. With the rise of large language models, comment generation has accelerated, particularly for Java and Python. In this paper, we present a first-of-its-kind framework for code comment generation in C, a language widely used in low-level tasks. We explore the effectiveness of few-shot learning, retrieval-augmented generation, and code structure based context modeling. Our work builds on prior field studies conducted across seven companies in India and the UK, resulting in a dataset of 20,206 human-annotated C comments rated for usefulness. By 2024, contributions from 40 academic teams and 50 hackathon groups expanded this dataset to 24,578 comments. We further introduce a reusable evaluation framework involving human experts and large language model evaluators, grounded in eight dimensions derived from four industry case studies. A subset of 11,797 comments has been annotated for the presence or absence of these dimensions, serving as both input for generation and evaluation. Our results show that GPT-4o mini-trained models produce comments most aligned with human-annotated ones, achieving a similarity score of 0.64, followed by Gemini 1.5 at 0.58. GPT-4.5 achieves the highest alignment with humans as an evaluator, while Llama-3.1-70b performs the lowest. • Generic RAG and source code based architecture for comment generation in C. • Evaluation with human and LLM critics for assessing and improving generated comments. • 11.7K code comments with annotated categories relevant for code comprehension. [ABSTRACT FROM AUTHOR]
Databáze: Academic Search Index
Popis
Abstrakt:Software maintenance requires substantial time for program comprehension. Code comments significantly improve understandability by providing a glass-box view of the code and are thus essential for maintainability. Prior work has analyzed comment attributes, built automated systems to detect irrelevant comments, and applied machine learning to generate meaningful comments. With the rise of large language models, comment generation has accelerated, particularly for Java and Python. In this paper, we present a first-of-its-kind framework for code comment generation in C, a language widely used in low-level tasks. We explore the effectiveness of few-shot learning, retrieval-augmented generation, and code structure based context modeling. Our work builds on prior field studies conducted across seven companies in India and the UK, resulting in a dataset of 20,206 human-annotated C comments rated for usefulness. By 2024, contributions from 40 academic teams and 50 hackathon groups expanded this dataset to 24,578 comments. We further introduce a reusable evaluation framework involving human experts and large language model evaluators, grounded in eight dimensions derived from four industry case studies. A subset of 11,797 comments has been annotated for the presence or absence of these dimensions, serving as both input for generation and evaluation. Our results show that GPT-4o mini-trained models produce comments most aligned with human-annotated ones, achieving a similarity score of 0.64, followed by Gemini 1.5 at 0.58. GPT-4.5 achieves the highest alignment with humans as an evaluator, while Llama-3.1-70b performs the lowest. • Generic RAG and source code based architecture for comment generation in C. • Evaluation with human and LLM critics for assessing and improving generated comments. • 11.7K code comments with annotated categories relevant for code comprehension. [ABSTRACT FROM AUTHOR]
ISSN:01678655
DOI:10.1016/j.patrec.2025.10.007