Assessing small language models for code generation: An empirical study with benchmarks.
Saved in:
| Title: | Assessing small language models for code generation: An empirical study with benchmarks. |
|---|---|
| Authors: | Mahade Hasan, Md1 (AUTHOR) mdmahade.hasan@tuni.fi, Waseem, Muhammad1 (AUTHOR) muhammad.waseem@tuni.fi, Kemell, Kai-Kristian1 (AUTHOR) kai-kristian.kemell@tuni.fi, Rasku, Jussi1 (AUTHOR) jussi.rasku@tuni.fi, Ala-Rantala, Juha1 (AUTHOR) juha.ala-rantala@tuni.fi, Abrahamsson, Pekka1 (AUTHOR) pekka.abrahamsson@tuni.fi |
| Source: | Journal of Systems & Software. Jun2026, Vol. 236, pN.PAG-N.PAG. 1p. |
| Subject Terms: | *PROGRAMMING languages, CODE generators, BENCHMARK problems (Computer science), COMPUTER performance, LANGUAGE models |
| Abstract: | • Evaluated 20 open-source Small Language Models (SLMs) on five code generation benchmarks, covering tasks such as code synthesis, summarization, and repair. • Identified that several compact SLMs (-<3B parameters) achieve competitive performance with lower resource requirements, making them suitable for deployment in memory-constrained environments. • Larger SLMs offer higher accuracy but demand up to 4 times more resources (VRAM) for a 10% improvement in pass@1, highlighting clear performance efficiency tradeoffs of SLMs. • No statistically significant performance differences across programming languages, suggesting SLMs, generalizability in multilingual code generation tasks within the programming languages tested in the study. The recent advancements of Small Language Models (SLMs) have opened new possibilities for efficient code generation. SLMs offer lightweight and cost-effective alternatives to Large Language Models (LLMs), making them attractive for use in resource-constrained environments. However, empirical understanding of SLMs, particularly their capabilities, limitations, and performance trade-offs in code generation remains limited. This study presents a comprehensive empirical evaluation of 20 open-source SLMs ranging from 0.4B to 10B parameters on five diverse code-related benchmarks (HumanEval, MBPP, Mercury, HumanEvalPack, and CodeXGLUE). The models are assessed along three dimensions: i) functional correctness of generated code, ii) computational efficiency and iii) performance across multiple programming languages. The findings of this study reveal that several compact SLMs achieve competitive results while maintaining a balance between performance and efficiency, making them viable for deployment in resource-constrained environments. However, achieving further improvements in accuracy requires switching to larger models. These models generally outperform their smaller counterparts, but they require much more computational power. We observe that for 10% performance improvements, models can require nearly a 4x increase in VRAM consumption, highlighting a trade-off between effectiveness and scalability. Besides, the multilingual performance analysis reveals that SLMs tend to perform better in languages such as Python, Java, and PHP, while exhibiting relatively weaker performance in Go, C++, and Ruby. However, statistical analysis suggests these differences are not significant, indicating a generalizability of SLMs across programming languages. Based on the findings, this work provides insights into the design and selection of SLMs for real-world code generation tasks. [ABSTRACT FROM AUTHOR] |
| Copyright of Journal of Systems & Software is the property of Elsevier B.V. and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.) | |
| Database: | Business Source Index |
| Abstract: | • Evaluated 20 open-source Small Language Models (SLMs) on five code generation benchmarks, covering tasks such as code synthesis, summarization, and repair. • Identified that several compact SLMs (-<3B parameters) achieve competitive performance with lower resource requirements, making them suitable for deployment in memory-constrained environments. • Larger SLMs offer higher accuracy but demand up to 4 times more resources (VRAM) for a 10% improvement in pass@1, highlighting clear performance efficiency tradeoffs of SLMs. • No statistically significant performance differences across programming languages, suggesting SLMs, generalizability in multilingual code generation tasks within the programming languages tested in the study. The recent advancements of Small Language Models (SLMs) have opened new possibilities for efficient code generation. SLMs offer lightweight and cost-effective alternatives to Large Language Models (LLMs), making them attractive for use in resource-constrained environments. However, empirical understanding of SLMs, particularly their capabilities, limitations, and performance trade-offs in code generation remains limited. This study presents a comprehensive empirical evaluation of 20 open-source SLMs ranging from 0.4B to 10B parameters on five diverse code-related benchmarks (HumanEval, MBPP, Mercury, HumanEvalPack, and CodeXGLUE). The models are assessed along three dimensions: i) functional correctness of generated code, ii) computational efficiency and iii) performance across multiple programming languages. The findings of this study reveal that several compact SLMs achieve competitive results while maintaining a balance between performance and efficiency, making them viable for deployment in resource-constrained environments. However, achieving further improvements in accuracy requires switching to larger models. These models generally outperform their smaller counterparts, but they require much more computational power. We observe that for 10% performance improvements, models can require nearly a 4x increase in VRAM consumption, highlighting a trade-off between effectiveness and scalability. Besides, the multilingual performance analysis reveals that SLMs tend to perform better in languages such as Python, Java, and PHP, while exhibiting relatively weaker performance in Go, C++, and Ruby. However, statistical analysis suggests these differences are not significant, indicating a generalizability of SLMs across programming languages. Based on the findings, this work provides insights into the design and selection of SLMs for real-world code generation tasks. [ABSTRACT FROM AUTHOR] |
|---|---|
| ISSN: | 01641212 |
| DOI: | 10.1016/j.jss.2026.112815 |
Full Text Finder
Nájsť tento článok vo Web of Science