Literate Programming with LLMs? - A Study on Rosetta Code and CodeNet

Uloženo v:
Podrobná bibliografie
Název: Literate Programming with LLMs? - A Study on Rosetta Code and CodeNet
Autoři: Sun, Simin, 1993, Staron, Miroslaw, 1977
Zdroj: IEEE Transactions on Software Engineering. In Press
Témata: Code-related Tasks, Literate Programming, Computation Experiment, Large Language Model(LLM)
Popis: Literate programming, a concept introduced by Knuth in 1984, emphasized the importance of combining human-readable documentation with machine-readable code as writing literate programs is a prerequisite for software quality. Our objective with this paper is to evaluate whether generative AI models, Large Language Models (LLM) like GPT-4, LLaMA or Falcon, are capable of literate programming because of their extensive use in software engineering. To truly achieve literate programming, LLMs must generate natural language descriptions and corresponding code with aligned semantics based on user prompts. In addition, their internal representation of programs should allow us to recognize both programming languages and their descriptions. To evaluate their capabilities, we conducted a study using the Rosetta Code and CodeNet repositories. We perform four computational experiments using the Rosetta Code repository, encompassing 1,228 tasks across 926 programming languages, and validate our findings on the larger CodeNet dataset, which includes 55 tasks and 52 languages. Our findings show that LLMs in the trillion-parameter class are capable of literate programming, while models in the million- and billion-parameter classes are better at recognizing programming languages than tasks. Based on these results, we conclude that modern LLMs inhibit a deeper ability to encode programming languages and the semantics of programming tasks, bringing us closer to realizing the full potential of literate programming.
Popis souboru: electronic
Přístupová URL adresa: https://research.chalmers.se/publication/549267
https://research.chalmers.se/publication/549267/file/549267_Fulltext.pdf
Databáze: SwePub
Popis
Abstrakt:Literate programming, a concept introduced by Knuth in 1984, emphasized the importance of combining human-readable documentation with machine-readable code as writing literate programs is a prerequisite for software quality. Our objective with this paper is to evaluate whether generative AI models, Large Language Models (LLM) like GPT-4, LLaMA or Falcon, are capable of literate programming because of their extensive use in software engineering. To truly achieve literate programming, LLMs must generate natural language descriptions and corresponding code with aligned semantics based on user prompts. In addition, their internal representation of programs should allow us to recognize both programming languages and their descriptions. To evaluate their capabilities, we conducted a study using the Rosetta Code and CodeNet repositories. We perform four computational experiments using the Rosetta Code repository, encompassing 1,228 tasks across 926 programming languages, and validate our findings on the larger CodeNet dataset, which includes 55 tasks and 52 languages. Our findings show that LLMs in the trillion-parameter class are capable of literate programming, while models in the million- and billion-parameter classes are better at recognizing programming languages than tasks. Based on these results, we conclude that modern LLMs inhibit a deeper ability to encode programming languages and the semantics of programming tasks, bringing us closer to realizing the full potential of literate programming.
ISSN:00985589
19393520
DOI:10.1109/TSE.2025.3629828