Beyond Syntax: How Do LLMs Understand Code?

Within software engineering research, Large Language Models (LLMs) are often treated as 'black boxes', with only their inputs and outputs being considered. In this paper, we take a machine interpretability approach to examine how LLMs internally represent and process code.We focus on varia...

Full description

Saved in:
Bibliographic Details
Published in:IEEE/ACM International Conference on Software Engineering: New Ideas and Emerging Technologies Results (Online) pp. 86 - 90
Main Authors: North, Marc, Atapour-Abarghouei, Amir, Bencomo, Nelly
Format: Conference Proceeding
Language:English
Published: IEEE 27.04.2025
Subjects:
ISSN:2832-7632
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract Within software engineering research, Large Language Models (LLMs) are often treated as 'black boxes', with only their inputs and outputs being considered. In this paper, we take a machine interpretability approach to examine how LLMs internally represent and process code.We focus on variable declaration and function scope, training classifier probes on the residual streams of LLMs as they process code written in different programming languages to explore how LLMs internally represent these concepts across different programming languages. We also look for specific attention heads that support these representations and examine how they behave for inputs of different languages.Our results show that LLMs have an understanding - and internal representation - of language-independent coding semantics that goes beyond the syntax of any specific programming language, using the same internal components to process code, regardless of the programming language that the code is written in. Furthermore, we find evidence that these language-independent semantic components exist in the middle layers of LLMs and are supported by language-specific components in the earlier layers that parse the syntax of specific languages and feed into these later semantic components.Finally, we discuss the broader implications of our work, particularly in relation to concerns that AI, with its reliance on large datasets to learn new programming languages, might limit innovation in programming language design. By demonstrating that LLMs have a language-independent representation of code, we argue that LLMs may be able to flexibly learn the syntax of new programming languages while retaining their semantic understanding of universal coding concepts. In doing so, LLMs could promote creativity in future programming language design, providing tools that augment rather than constrain the future of software engineering.
AbstractList Within software engineering research, Large Language Models (LLMs) are often treated as 'black boxes', with only their inputs and outputs being considered. In this paper, we take a machine interpretability approach to examine how LLMs internally represent and process code.We focus on variable declaration and function scope, training classifier probes on the residual streams of LLMs as they process code written in different programming languages to explore how LLMs internally represent these concepts across different programming languages. We also look for specific attention heads that support these representations and examine how they behave for inputs of different languages.Our results show that LLMs have an understanding - and internal representation - of language-independent coding semantics that goes beyond the syntax of any specific programming language, using the same internal components to process code, regardless of the programming language that the code is written in. Furthermore, we find evidence that these language-independent semantic components exist in the middle layers of LLMs and are supported by language-specific components in the earlier layers that parse the syntax of specific languages and feed into these later semantic components.Finally, we discuss the broader implications of our work, particularly in relation to concerns that AI, with its reliance on large datasets to learn new programming languages, might limit innovation in programming language design. By demonstrating that LLMs have a language-independent representation of code, we argue that LLMs may be able to flexibly learn the syntax of new programming languages while retaining their semantic understanding of universal coding concepts. In doing so, LLMs could promote creativity in future programming language design, providing tools that augment rather than constrain the future of software engineering.
Author North, Marc
Bencomo, Nelly
Atapour-Abarghouei, Amir
Author_xml – sequence: 1
  givenname: Marc
  surname: North
  fullname: North, Marc
  email: marc.north@durham.ac.uk
  organization: Durham University,CS,Durham,UK
– sequence: 2
  givenname: Amir
  surname: Atapour-Abarghouei
  fullname: Atapour-Abarghouei, Amir
  email: amir.atapour-abarghouei@durham.ac.uk
  organization: Durham University,CS,Durham,UK
– sequence: 3
  givenname: Nelly
  surname: Bencomo
  fullname: Bencomo, Nelly
  email: nelly.bencomo@durham.ac.uk
  organization: Durham University,CS,Durham,UK
BookMark eNotj01LhFAUhm9R0MzkP2hh69DuOUfvR5sos0awgqZZD1c9wkRdQ4Xy3yc0q3fxvDzwLMWJ7zwLcQkyBpD2usg2efRS5G9KUYoxSkxjKSXSkQistoYIUtIAcCwWaAgjrQjPRDAMH_ONEECDWYire54634SbyY_u9yZcdz_hQxeW5fMQbn3D_TC6GWddw7fn4rR1nwMHh12J7WP-nq2j8vWpyO7KyM3aMSLrdEIyqaxL0CnQEpPatJCm1NZVrVmTaYmbupEtOSKDZk6orJLGVoZrWomLf--emXff_f7L9dNurkayytIfldlEUQ
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/ICSE-NIER66352.2025.00023
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 9798331537111
EISSN 2832-7632
EndPage 90
ExternalDocumentID 11023969
Genre orig-research
GroupedDBID 6IE
6IL
6IN
AAWTH
ABLEC
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
OCL
RIE
RIL
ID FETCH-LOGICAL-a321t-39a74304b9a42a617024c8f1553fcbc7e738f3edcd0f3a33828635b96089b8ec3
IEDL.DBID RIE
ISICitedReferencesCount 0
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001552151900018&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate Wed Jun 18 06:01:24 EDT 2025
IsDoiOpenAccess false
IsOpenAccess true
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a321t-39a74304b9a42a617024c8f1553fcbc7e738f3edcd0f3a33828635b96089b8ec3
OpenAccessLink https://doi.org/10.1109/ICSE-NIER66352.2025.00023
PageCount 5
ParticipantIDs ieee_primary_11023969
PublicationCentury 2000
PublicationDate 2025-April-27
PublicationDateYYYYMMDD 2025-04-27
PublicationDate_xml – month: 04
  year: 2025
  text: 2025-April-27
  day: 27
PublicationDecade 2020
PublicationTitle IEEE/ACM International Conference on Software Engineering: New Ideas and Emerging Technologies Results (Online)
PublicationTitleAbbrev ICSE-NIER
PublicationYear 2025
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0003211718
Score 2.2900898
Snippet Within software engineering research, Large Language Models (LLMs) are often treated as 'black boxes', with only their inputs and outputs being considered. In...
SourceID ieee
SourceType Publisher
StartPage 86
SubjectTerms Codes
Computer languages
Encoding
Large language models
Large Language Models (LLMs)
Mechanistic interpretability
Semantics
Software engineering
Streams
Syntactics
Technological innovation
Training
Title Beyond Syntax: How Do LLMs Understand Code?
URI https://ieeexplore.ieee.org/document/11023969
WOSCitedRecordID wos001552151900018&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1NSwMxEB1sEfGkYsVvIniT2LqJm8SLh9rSQi3FWuit5GMCXnal3frx7012a_XiwVuYUyaBDPMm7z2Ay1vlBCrvKefOUu4RqZLIqWHCaJOa1Hlemk2I4VBOp2q0IquXXBhELD-f4XVclrN8l9tlhMqaN1FnQKWqBjUh0oqstQZUWGhlwkO7BRcrHc1mvz3u0GG_8xSraiRdJRE_aUVnol9OKmUh6e78cwu70Pih5JHRutjswQZm-3BV8U_I-DMr9Mcd6eXv5CEng8HjgkzWrBXSzh3eN2DS7Ty3e3TlfkB1SKWgTOlQ3VvcKM0THXXTE26ljz4_3horUDDpGTrrWp7p0GkmMqRpQkcilZFo2QHUszzDQyDWKRbRXy05cmmt0SzVHo2UjvkQO4JGzHT2WglczL6TPP4jfgLb8TDjUCURp1Av5ks8g037Vrws5ufltXwB9MWMsw
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV09T8MwELWgIGACRBHfGIkNmYbYTWwWhpKqFWlU0VbqVvnjLLEkqE35-PfYaSksDGzWTT5b8une-b2H0HVTmBiEtYQxowmzAERwYETRWEkVqchYVplNxFnGx2PRX5LVKy4MAFSfz-DWL6tZvin03ENljTuvMyAisY42moyFwYKutYJUqGtm3FO7ha6WSpqNbmuQkKybPPu66mlXoUdQAu9N9MtLpSol7d1_bmIP1X9Iebi_Kjf7aA3yA3SzYKDgwWdeyo973Cne8WOB07Q3w6MVbwW3CgMPdTRqJ8NWhyz9D4h0qZSECunqe8CUkCyUXjk9ZJpb7_RjtdIxxJRbCkabwFLpes2QuzSV60m4UBw0PUS1vMjhCGFtBPX4r-QMGNdaSRpJC4pzQ62LHaO6z3TyupC4mHwnefJH_BJtd4a9dJJ2s6dTtOMP1o9YwvgM1crpHM7Rpn4rX2bTi-qKvgB9yI_6
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=IEEE%2FACM+International+Conference+on+Software+Engineering%3A+New+Ideas+and+Emerging+Technologies+Results+%28Online%29&rft.atitle=Beyond+Syntax%3A+How+Do+LLMs+Understand+Code%3F&rft.au=North%2C+Marc&rft.au=Atapour-Abarghouei%2C+Amir&rft.au=Bencomo%2C+Nelly&rft.date=2025-04-27&rft.pub=IEEE&rft.eissn=2832-7632&rft.spage=86&rft.epage=90&rft_id=info:doi/10.1109%2FICSE-NIER66352.2025.00023&rft.externalDocID=11023969