Do Current Language Models Support Code Intelligence for R Programming Language?

Uloženo v:
Podrobná bibliografie
Název: Do Current Language Models Support Code Intelligence for R Programming Language?
Autoři: Zhao, Zixiao, Fard, Fatemeh
Zdroj: ACM Transactions on Software Engineering & Methodology; Nov2025, Vol. 34 Issue 8, p1-39, 39p
Témata: LANGUAGE models, SOFTWARE engineering, ABSTRACTION (Computer science), PROGRAMMING languages
Abstrakt: Recent advancements in developing Pre-trained Language Models for Code (Code-PLMs) have urged many areas of Software Engineering (SE) and brought breakthrough results for many SE tasks. Though these models have achieved the state-of-the-art performance for SE tasks for many popular programming languages, such as Java and Python, the Scientific Software and its related languages like R programming language have rarely benefited or even been evaluated with the Code-PLMs. Research has shown that R has many differences with other programming languages and requires specific techniques. In this study, we provide the first insights for code intelligence for R. For this purpose, we collect and open source an R dataset, and evaluate Code-PLMs for the two tasks of code summarization and method name prediction using several settings and strategies, including the differences in two R styles, Tidy-verse and Base R. Our results demonstrate that the studied models have experienced varying degrees of performance degradation when processing R programming language code, which is supported by human evaluation. Additionally, not all models show performance improvement in R-specific tasks even after multi-language fine-tuning. The dual syntax paradigms in R significantly impact the models' performance, particularly in code summarization tasks. Furthermore, the project-specific context inherent in R codebases significantly impacts the performance when attempting cross-project training. Interestingly, even when Large Language Models like CodeLlama and StarCoder2 are used for code generation, the Pass@K (\(K = 1,5,10\)) results lag significantly behind Python scores. Our research shows that R as a low-resource language requires different techniques to collect a high-quality data. Specifically separating the two R styles has a great impact on the results and the separate dataset could increase the performance of the models. Our research sheds light on the capabilities of Code-PLMs and opens new research directions for researchers and practitioners for developing code intelligence tools and techniques for R. With R's widespread use and popularity, the results of our study can potentially benefit a large community of R developers, both in research and industry. [ABSTRACT FROM AUTHOR]
Copyright of ACM Transactions on Software Engineering & Methodology is the property of Association for Computing Machinery and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Databáze: Complementary Index
Buďte první, kdo okomentuje tento záznam!
Nejprve se musíte přihlásit.