Beyond Syntax: How Do LLMs Understand Code?

Within software engineering research, Large Language Models (LLMs) are often treated as 'black boxes', with only their inputs and outputs being considered. In this paper, we take a machine interpretability approach to examine how LLMs internally represent and process code.We focus on varia...

Celý popis

Uložené v:

Podrobná bibliografia
Vydané v:	IEEE/ACM International Conference on Software Engineering: New Ideas and Emerging Technologies Results (Online) s. 86 - 90
Hlavní autori:	North, Marc, Atapour-Abarghouei, Amir, Bencomo, Nelly
Médium:	Konferenčný príspevok..
Jazyk:	English
Vydavateľské údaje:	IEEE 27.04.2025
Predmet:	Codes Computer languages Encoding Large language models Large Language Models (LLMs) Mechanistic interpretability Semantics Software engineering Streams Syntactics Technological innovation Training
ISSN:	2832-7632
On-line prístup:	Získať plný text
Tagy:	Pridať tag Žiadne tagy, Buďte prvý, kto otaguje tento záznam!

Abstract	Within software engineering research, Large Language Models (LLMs) are often treated as 'black boxes', with only their inputs and outputs being considered. In this paper, we take a machine interpretability approach to examine how LLMs internally represent and process code.We focus on variable declaration and function scope, training classifier probes on the residual streams of LLMs as they process code written in different programming languages to explore how LLMs internally represent these concepts across different programming languages. We also look for specific attention heads that support these representations and examine how they behave for inputs of different languages.Our results show that LLMs have an understanding - and internal representation - of language-independent coding semantics that goes beyond the syntax of any specific programming language, using the same internal components to process code, regardless of the programming language that the code is written in. Furthermore, we find evidence that these language-independent semantic components exist in the middle layers of LLMs and are supported by language-specific components in the earlier layers that parse the syntax of specific languages and feed into these later semantic components.Finally, we discuss the broader implications of our work, particularly in relation to concerns that AI, with its reliance on large datasets to learn new programming languages, might limit innovation in programming language design. By demonstrating that LLMs have a language-independent representation of code, we argue that LLMs may be able to flexibly learn the syntax of new programming languages while retaining their semantic understanding of universal coding concepts. In doing so, LLMs could promote creativity in future programming language design, providing tools that augment rather than constrain the future of software engineering.
AbstractList	Within software engineering research, Large Language Models (LLMs) are often treated as 'black boxes', with only their inputs and outputs being considered. In this paper, we take a machine interpretability approach to examine how LLMs internally represent and process code.We focus on variable declaration and function scope, training classifier probes on the residual streams of LLMs as they process code written in different programming languages to explore how LLMs internally represent these concepts across different programming languages. We also look for specific attention heads that support these representations and examine how they behave for inputs of different languages.Our results show that LLMs have an understanding - and internal representation - of language-independent coding semantics that goes beyond the syntax of any specific programming language, using the same internal components to process code, regardless of the programming language that the code is written in. Furthermore, we find evidence that these language-independent semantic components exist in the middle layers of LLMs and are supported by language-specific components in the earlier layers that parse the syntax of specific languages and feed into these later semantic components.Finally, we discuss the broader implications of our work, particularly in relation to concerns that AI, with its reliance on large datasets to learn new programming languages, might limit innovation in programming language design. By demonstrating that LLMs have a language-independent representation of code, we argue that LLMs may be able to flexibly learn the syntax of new programming languages while retaining their semantic understanding of universal coding concepts. In doing so, LLMs could promote creativity in future programming language design, providing tools that augment rather than constrain the future of software engineering.
Author	North, Marc Bencomo, Nelly Atapour-Abarghouei, Amir
Author_xml	– sequence: 1 givenname: Marc surname: North fullname: North, Marc email: marc.north@durham.ac.uk organization: Durham University,CS,Durham,UK – sequence: 2 givenname: Amir surname: Atapour-Abarghouei fullname: Atapour-Abarghouei, Amir email: amir.atapour-abarghouei@durham.ac.uk organization: Durham University,CS,Durham,UK – sequence: 3 givenname: Nelly surname: Bencomo fullname: Bencomo, Nelly email: nelly.bencomo@durham.ac.uk organization: Durham University,CS,Durham,UK
BookMark	eNotj01LhFAUhm9R0MzkP2hh69DuOUfvR5sos0awgqZZD1c9wkRdQ4Xy3yc0q3fxvDzwLMWJ7zwLcQkyBpD2usg2efRS5G9KUYoxSkxjKSXSkQistoYIUtIAcCwWaAgjrQjPRDAMH_ONEECDWYire54634SbyY_u9yZcdz_hQxeW5fMQbn3D_TC6GWddw7fn4rR1nwMHh12J7WP-nq2j8vWpyO7KyM3aMSLrdEIyqaxL0CnQEpPatJCm1NZVrVmTaYmbupEtOSKDZk6orJLGVoZrWomLf--emXff_f7L9dNurkayytIfldlEUQ
CODEN	IEEPAD
ContentType	Conference Proceeding
DBID	6IE 6IL CBEJK RIE RIL
DOI	10.1109/ICSE-NIER66352.2025.00023
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE/IET Electronic Library (IEL) (UW System Shared) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
EISBN	9798331537111
EISSN	2832-7632
EndPage	90
ExternalDocumentID	11023969
Genre	orig-research
GroupedDBID	6IE 6IL 6IN AAWTH ABLEC ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK OCL RIE RIL
ID	FETCH-LOGICAL-a321t-39a74304b9a42a617024c8f1553fcbc7e738f3edcd0f3a33828635b96089b8ec3
IEDL.DBID	RIE
ISICitedReferencesCount	0
ISICitedReferencesURI	http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001552151900018&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate	Wed Jun 18 06:01:24 EDT 2025
IsDoiOpenAccess	false
IsOpenAccess	true
IsPeerReviewed	false
IsScholarly	true
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-a321t-39a74304b9a42a617024c8f1553fcbc7e738f3edcd0f3a33828635b96089b8ec3
OpenAccessLink	https://doi.org/10.1109/ICSE-NIER66352.2025.00023
PageCount	5
ParticipantIDs	ieee_primary_11023969
PublicationCentury	2000
PublicationDate	2025-April-27
PublicationDateYYYYMMDD	2025-04-27
PublicationDate_xml	– month: 04 year: 2025 text: 2025-April-27 day: 27
PublicationDecade	2020
PublicationTitle	IEEE/ACM International Conference on Software Engineering: New Ideas and Emerging Technologies Results (Online)
PublicationTitleAbbrev	ICSE-NIER
PublicationYear	2025
Publisher	IEEE
Publisher_xml	– name: IEEE
SSID	ssj0003211718
Score	2.2900898
Snippet	Within software engineering research, Large Language Models (LLMs) are often treated as 'black boxes', with only their inputs and outputs being considered. In...
SourceID	ieee
SourceType	Publisher
StartPage	86
SubjectTerms	Codes Computer languages Encoding Large language models Large Language Models (LLMs) Mechanistic interpretability Semantics Software engineering Streams Syntactics Technological innovation Training
Title	Beyond Syntax: How Do LLMs Understand Code?
URI	https://ieeexplore.ieee.org/document/11023969
WOSCitedRecordID	wos001552151900018&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1NSwMxEB20iHhSseI3EbxJbNxNN4kXD7WlhVqKtdBbyccEvOxKu_Xj35tsa_XiwVvIKTOBPN5M3huAK8O4R9dk1BiHlLumo4ajozZgiTTGqMxV7vp9MRjIyUQNV2L1SguDiNXnM7yJy6qX7wq7iKWyxm30GVCZ2oRNIbKlWGtdUEkDlQkP7TZcrnw0G73WqE0HvfZTRNUoukpi_YTFyUS_JqlUQNLZ_ecR9qD-I8kjwzXY7MMG5gdwvdSfkNFnXuqPO9It3slDQfr9xzkZr1UrpFU4vK_DuNN-bnXpavoB1SGUkqZKB3Rn3CjNEx190xNupY9zfrw1VqBIpU_RWcd8qgPTTGQIMySXSWUk2vQQanmR4xGQQKq0D8wlmrPxNFyBROalEshNEzPHjqEeI52-Lg0upt9Bnvyxfwo7MZmxqZKIM6iVswWew5Z9K1_ms4vqWr4AwAOL5w
linkProvider	IEEE
linkToHtml	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV09T8MwED1BQcAEiCK-MRIbMjWJk9gsDKVVK9Kooq3UrYrti8SSoDbl499jp6WwMLBZnnxnyU_vzu8dwLViPEMTMKqUQcpNYKjiaKi2WCKUUjI0lbt-HCWJGI9lfylWr7QwiFh9PsNbt6x6-abQc1cqa9w5nwEZynXYCDj32EKutSqp-JbM2Kd2C66WTpqNbnPQokm39exw1cmuPFdBYW420a9ZKhWUtHf_eYg9qP-I8kh_BTf7sIb5AdwsFChk8JmX6cc96RTv5LEgcdybkdFKt0KahcGHOozarWGzQ5fzD2hqQympL1OL74wrmXIvdc7pHtcic5N-Mq10hJEvMh-NNizzU8s1PWHDtOllQiqB2j-EWl7keATE0qo0s9zF2bNx316CQJYJGSFXAYaGHUPdRTp5XVhcTL6DPPlj_xK2O8NePIm7ydMp7LjEuhaLF51BrZzO8Rw29Vv5MpteVFf0BT-Sjy4
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=IEEE%2FACM+International+Conference+on+Software+Engineering%3A+New+Ideas+and+Emerging+Technologies+Results+%28Online%29&rft.atitle=Beyond+Syntax%3A+How+Do+LLMs+Understand+Code%3F&rft.au=North%2C+Marc&rft.au=Atapour-Abarghouei%2C+Amir&rft.au=Bencomo%2C+Nelly&rft.date=2025-04-27&rft.pub=IEEE&rft.eissn=2832-7632&rft.spage=86&rft.epage=90&rft_id=info:doi/10.1109%2FICSE-NIER66352.2025.00023&rft.externalDocID=11023969