Language Models for Code Completion: A Practical Evaluation

Transformer-based language models for automatic code completion have shown great promise so far, yet the evaluation of these models rarely uses real data. This study provides both quantitative and qualitative assessments of three public code language models when completing real-world code. We first...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Proceedings / International Conference on Software Engineering s. 956 - 968
Hlavní autoři: Izadi, Maliheh, Katzy, Jonathan, van Dam, Tim, Otten, Marc, Popescu, Razvan Mihai, van Deursen, Arie
Médium: Konferenční příspěvek
Jazyk:angličtina
Vydáno: ACM 14.04.2024
Témata:
ISSN:1558-1225
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Abstract Transformer-based language models for automatic code completion have shown great promise so far, yet the evaluation of these models rarely uses real data. This study provides both quantitative and qualitative assessments of three public code language models when completing real-world code. We first developed an open-source IDE extension, Code4Me, for the online evaluation of the models. We collected real auto-completion usage data for over a year from more than 1200 users, resulting in over 600K valid completions. These models were then evaluated using six standard metrics across twelve programming languages. Next, we conducted a qualitative study of 1690 real-world completion requests to identify the reasons behind the poor model performance. A comparative analysis of the models' performance in online and offline settings was also performed, using benchmark synthetic datasets and two masking strategies. Our findings suggest that while developers utilize code completion across various languages, the best results are achieved for mainstream languages such as Python and Java. InCoder outper-formed the other models across all programming languages, high-lighting the significance of training data and objectives. Our study also revealed that offline evaluations do not accurately reflect real-world scenarios. Upon qualitative analysis of the models' predictions, we found that 66.3% of failures were due to models' limitations, 24.4% occurred due to inappropriate model usage in a development context, and 9.3% were valid requests that developers overwrote. Given these findings, we propose several strategies to overcome the current limitations. These include refining training objectives, improving resilience to typographical errors, adopting hybrid approaches, and enhancing implementations and usability.
AbstractList Transformer-based language models for automatic code completion have shown great promise so far, yet the evaluation of these models rarely uses real data. This study provides both quantitative and qualitative assessments of three public code language models when completing real-world code. We first developed an open-source IDE extension, Code4Me, for the online evaluation of the models. We collected real auto-completion usage data for over a year from more than 1200 users, resulting in over 600K valid completions. These models were then evaluated using six standard metrics across twelve programming languages. Next, we conducted a qualitative study of 1690 real-world completion requests to identify the reasons behind the poor model performance. A comparative analysis of the models' performance in online and offline settings was also performed, using benchmark synthetic datasets and two masking strategies. Our findings suggest that while developers utilize code completion across various languages, the best results are achieved for mainstream languages such as Python and Java. InCoder outper-formed the other models across all programming languages, high-lighting the significance of training data and objectives. Our study also revealed that offline evaluations do not accurately reflect real-world scenarios. Upon qualitative analysis of the models' predictions, we found that 66.3% of failures were due to models' limitations, 24.4% occurred due to inappropriate model usage in a development context, and 9.3% were valid requests that developers overwrote. Given these findings, we propose several strategies to overcome the current limitations. These include refining training objectives, improving resilience to typographical errors, adopting hybrid approaches, and enhancing implementations and usability.
Author Popescu, Razvan Mihai
van Deursen, Arie
Otten, Marc
Katzy, Jonathan
Izadi, Maliheh
van Dam, Tim
Author_xml – sequence: 1
  givenname: Maliheh
  surname: Izadi
  fullname: Izadi, Maliheh
  email: m.izadi@tudelft.nl
  organization: Delft University of Technology,Delft,Netherlands
– sequence: 2
  givenname: Jonathan
  surname: Katzy
  fullname: Katzy, Jonathan
  email: j.b.katzy@tudelft.nl
  organization: Delft University of Technology,Delft,Netherlands
– sequence: 3
  givenname: Tim
  surname: van Dam
  fullname: van Dam, Tim
  email: t.o.vandam@student.tudelft.nl
  organization: Delft University of Technology,Delft,Netherlands
– sequence: 4
  givenname: Marc
  surname: Otten
  fullname: Otten, Marc
  email: m.j.c.otten@student.tudelft.nl
  organization: Delft University of Technology,Delft,Netherlands
– sequence: 5
  givenname: Razvan Mihai
  surname: Popescu
  fullname: Popescu, Razvan Mihai
  email: r.popescu-3@student.tudelft.nl
  organization: Delft University of Technology,Delft,Netherlands
– sequence: 6
  givenname: Arie
  surname: van Deursen
  fullname: van Deursen, Arie
  email: arie.vandeursen@tudelft.nl
  organization: Delft University of Technology,Delft,Netherlands
BookMark eNotj01PwzAQRA0CiVJy5sLBfyDF9npjG05VVT6kIDjAudrE6ypSmlRJi8S_JxVcZp7m8KS5Fhdd37EQt1ottLZ4DxgcKlhAAUGDPxNZcMFbpZwy2tlzMdOIPtfG4JXIxrGpFFpAV1iYiceSuu2Rtizf-sjtKFM_yNWEU-z2LR-avnuQS_kxUH1oamrl-pvaI532G3GZqB05---5-Hpaf65e8vL9-XW1LHMyYHxuYwCiKqVYmFglbws0ziTtGJhjTJqciQp1QaR0jT5GWzkMjOCtq8nDXNz9eRtm3uyHZkfDz0ZPL3xwAL__R0ll
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IH
CBEJK
ESBDL
RIE
RIO
DOI 10.1145/3597503.3639138
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan (POP) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Xplore Open Access Journals
IEEE/IET Electronic Library (IEL) (UW System Shared)
IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE/IET Electronic Library (IEL) (UW System Shared)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISBN 9798400702174
EISSN 1558-1225
EndPage 968
ExternalDocumentID 10548973
Genre orig-research
GroupedDBID -~X
.4S
.DC
29O
5VS
6IE
6IF
6IH
6IK
6IL
6IM
6IN
8US
AAJGR
AAWTH
ABLEC
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
ARCSS
AVWKF
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
EDO
ESBDL
FEDTE
I-F
IEGSK
IJVOP
IPLJI
M43
OCL
RIE
RIL
RIO
ID FETCH-LOGICAL-a2328-4d93aabffd62dbf8465272f17e3eeddf1a72d0516aa01c58dd4b759e53847ca83
IEDL.DBID RIE
IngestDate Wed Aug 27 01:53:12 EDT 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a2328-4d93aabffd62dbf8465272f17e3eeddf1a72d0516aa01c58dd4b759e53847ca83
OpenAccessLink https://ieeexplore.ieee.org/document/10548973
PageCount 13
ParticipantIDs ieee_primary_10548973
PublicationCentury 2000
PublicationDate 2024-April-14
PublicationDateYYYYMMDD 2024-04-14
PublicationDate_xml – month: 04
  year: 2024
  text: 2024-April-14
  day: 14
PublicationDecade 2020
PublicationTitle Proceedings / International Conference on Software Engineering
PublicationTitleAbbrev ICSE
PublicationYear 2024
Publisher ACM
Publisher_xml – name: ACM
SSID ssib054357643
ssib055306466
ssj0006499
Score 2.4258568
Snippet Transformer-based language models for automatic code completion have shown great promise so far, yet the evaluation of these models rarely uses real data. This...
SourceID ieee
SourceType Publisher
StartPage 956
SubjectTerms Analytical models
Automatic Code Completion
CodeGPT
Codes
Data models
Evaluation
IDE
InCoder
Language Models
Open Source
Predictive models
Training
Training data
Transformers
UniXcoder
Title Language Models for Code Completion: A Practical Evaluation
URI https://ieeexplore.ieee.org/document/10548973
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV07T8MwED7RioGpPIp4ywNrShy_YphQ1YqhqjoA6lY5OVtCQinqg9_POU1LFwY2K8pgne98X-x83wdw72OiKKTkFUomknpSYlNVJqlBHwpdqLy2b3sfmfE4n07tpCGr11wY733985nvxWF9l4_zch2PyqjCCV9bI1rQMkZvyFrb5FHU982etlS0w9EyYpVmW9aE7RttHy7VgyAkrVLRE9SieWSn7Jmr1L1l2PnnrI6h-8vSY5Nd_zmBA1-dQmdr08Caqj2Dp1FzJsmi8dnnkhFOZX0asvhuFN-eV4_smW2ki2jN2GAnAd6Ft-Hgtf-SNJ4JiSNslCcSrXCuCAF1hkUgdKEykwVuvKDZYODOZEiFqJ1LealyRFkYZT3te9KULhfn0K7mlb8ApnWgeJYYqazSoqMPM2Wc4zpg5vPCXkI3BmP2tZHFmG3jcPXH82s4yggRxKsYLm-gvVqs_S0clt-rj-Xirl7MHyXRnB0
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV07T8MwED5BQYKpPIp444E1JY5fMUwItSoiVB0K6lY5sS0hoRT1we_nnKalCwObFWWwzne-L3a-7wO4dSFRhMXkZYJHHHtSpGNRRLGyzucyF2ll3_aeqX4_HY30oCarV1wY51z185lrh2F1l28nxSIclWGFI77Wim3DjuA8iZd0rVX6COz8akNdKhjiSB7QSr0xS0T3tboP5eKOIZYWMWszbNI08FM27FWq7tJt_nNeB9D65emRwboDHcKWK4-guTJqIHXdHsNDVp9KkmB99jkjiFTJEw5JeDfIb0_Ke_JIluJFuGqksxYBb8FbtzN86kW1a0JkEB2lEbeaGZN7b2Vic4_4QiQq8VQ5hrOxnhqVWCxFaUxMC5Fay3MltMOdj6vCpOwEGuWkdKdApPQYz8IGMivX1uCnmVDGUOlt4tJcn0ErBGP8tRTGGK_icP7H8xvY6w1fs3H23H-5gP0E8UG4mKH8Ehrz6cJdwW7xPf-YTa-rhf0BeCyfZA
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+%2F+International+Conference+on+Software+Engineering&rft.atitle=Language+Models+for+Code+Completion%3A+A+Practical+Evaluation&rft.au=Izadi%2C+Maliheh&rft.au=Katzy%2C+Jonathan&rft.au=van+Dam%2C+Tim&rft.au=Otten%2C+Marc&rft.date=2024-04-14&rft.pub=ACM&rft.eissn=1558-1225&rft.spage=956&rft.epage=968&rft_id=info:doi/10.1145%2F3597503.3639138&rft.externalDocID=10548973