Design pattern recognition: a study of large language models

Uložené v:
Podrobná bibliografia
Názov: Design pattern recognition: a study of large language models
Autori: Kumar Pandey, Sushant, 1990, Chand, Sivajeet, 1998, Horkoff, Jennifer, 1980, Staron, Miroslaw, 1977, Ochodek, Miroslaw, Durisic, Darko
Zdroj: Empirical Software Engineering Design-Pattern-Recognition-using-Large-Language-Models (Replication package). 30(3)
Predmety: Large language model, Software reengineering, Design pattern recognition, Deep learning
Popis: ContextAs Software Engineering (SE) practices evolve due to extensive increases in software size and complexity, the importance of tools to analyze and understand source code grows significantly.ObjectiveThis study aims to evaluate the abilities of Large Language Models (LLMs) in identifying DPs in source code, which can facilitate the development of better Design Pattern Recognition (DPR) tools. We compare the effectiveness of different LLMs in capturing semantic information relevant to the DPR task.MethodsWe studied Gang of Four (GoF) DPs from the P-MARt repository of curated Java projects. State-of-the-art language models, including Code2Vec, CodeBERT, CodeGPT, CodeT5, and RoBERTa, are used to generate embeddings from source code. These embeddings are then used for DPR via a k-nearest neighbors prediction. Precision, recall, and F1-score metrics are computed to evaluate performance.ResultsRoBERTa is the top performer, followed by CodeGPT and CodeBERT, which showed mean F1 Scores of 0.91, 0.79, and 0.77, respectively. The results show that LLMs without explicit pre-training can effectively store semantics and syntactic information, which can be used in building better DPR tools.ConclusionThe performance of LLMs in DPR is comparable to existing state-of-the-art methods but with less effort in identifying pattern-specific rules and pre-training. Factors influencing prediction performance in Java files/programs are analyzed. These findings can advance software engineering practices and show the importance and abilities of LLMs for effective DPR in source code.
Popis súboru: electronic
Prístupová URL adresa: https://research.chalmers.se/publication/545321
https://research.chalmers.se/publication/545339
https://research.chalmers.se/publication/545339/file/545339_Fulltext.pdf
Databáza: SwePub
Popis
Abstrakt:ContextAs Software Engineering (SE) practices evolve due to extensive increases in software size and complexity, the importance of tools to analyze and understand source code grows significantly.ObjectiveThis study aims to evaluate the abilities of Large Language Models (LLMs) in identifying DPs in source code, which can facilitate the development of better Design Pattern Recognition (DPR) tools. We compare the effectiveness of different LLMs in capturing semantic information relevant to the DPR task.MethodsWe studied Gang of Four (GoF) DPs from the P-MARt repository of curated Java projects. State-of-the-art language models, including Code2Vec, CodeBERT, CodeGPT, CodeT5, and RoBERTa, are used to generate embeddings from source code. These embeddings are then used for DPR via a k-nearest neighbors prediction. Precision, recall, and F1-score metrics are computed to evaluate performance.ResultsRoBERTa is the top performer, followed by CodeGPT and CodeBERT, which showed mean F1 Scores of 0.91, 0.79, and 0.77, respectively. The results show that LLMs without explicit pre-training can effectively store semantics and syntactic information, which can be used in building better DPR tools.ConclusionThe performance of LLMs in DPR is comparable to existing state-of-the-art methods but with less effort in identifying pattern-specific rules and pre-training. Factors influencing prediction performance in Java files/programs are analyzed. These findings can advance software engineering practices and show the importance and abilities of LLMs for effective DPR in source code.
ISSN:15737616
13823256
DOI:10.1007/s10664-025-10625-1