COMEX: A Tool for Generating Customized Source Code Representations

Learning effective representations of source code is critical for any Machine Learning for Software Engineering (ML4SE) system. Inspired by natural language processing, large language models (LLMs) like Codex and CodeGen treat code as generic sequences of text and are trained on huge corpora of code...

Full description

Saved in:
Bibliographic Details
Published in:IEEE/ACM International Conference on Automated Software Engineering : [proceedings] pp. 2054 - 2057
Main Authors: Das, Debeshee, Mathews, Noble Saji, Mathai, Alex, Tamilselvam, Srikanth, Sedamaki, Kranthi, Chimalakonda, Sridhar, Kumar, Atul
Format: Conference Proceeding
Language:English
Published: IEEE 11.09.2023
Subjects:
ISSN:2643-1572
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract Learning effective representations of source code is critical for any Machine Learning for Software Engineering (ML4SE) system. Inspired by natural language processing, large language models (LLMs) like Codex and CodeGen treat code as generic sequences of text and are trained on huge corpora of code data, achieving state of the art performance on several software engineering (SE) tasks. However, valid source code, unlike natural language, follows a strict structure and pattern governed by the underlying grammar of the programming language. Current LLMs do not exploit this property of the source code as they treat code like a sequence of tokens and overlook key structural and semantic properties of code that can be extracted from code-views like the Control Flow Graph (CFG), Data Flow Graph (DFG), Abstract Syntax Tree (AST), etc. Unfortunately, the process of generating and integrating code-views for every programming language is cumbersome and time consuming. To overcome this barrier, we propose our tool COMEX - a framework that allows researchers and developers to create and combine multiple code-views which can be used by machine learning (ML) models for various SE tasks. Some salient features of our tool are: (i) it works directly on source code (which need not be compilable), (ii) it currently supports Java and C#, (iii) it can analyze both method-level snippets and program-level snippets by using both intra-procedural and inter-procedural analysis, and (iv) it is easily extendable to other languages as it is built on tree-sitter - a widely used incremental parser that supports over 40 languages. We believe this easy-to-use code-view generation and customization tool will give impetus to research in source code representation learning methods and ML4SE. The source code and demonstration of our tool can be found at https://github.com/IBM/tree-sitter-codeviews and https://youtu.be/GER6U87FVbU, respectively.
AbstractList Learning effective representations of source code is critical for any Machine Learning for Software Engineering (ML4SE) system. Inspired by natural language processing, large language models (LLMs) like Codex and CodeGen treat code as generic sequences of text and are trained on huge corpora of code data, achieving state of the art performance on several software engineering (SE) tasks. However, valid source code, unlike natural language, follows a strict structure and pattern governed by the underlying grammar of the programming language. Current LLMs do not exploit this property of the source code as they treat code like a sequence of tokens and overlook key structural and semantic properties of code that can be extracted from code-views like the Control Flow Graph (CFG), Data Flow Graph (DFG), Abstract Syntax Tree (AST), etc. Unfortunately, the process of generating and integrating code-views for every programming language is cumbersome and time consuming. To overcome this barrier, we propose our tool COMEX - a framework that allows researchers and developers to create and combine multiple code-views which can be used by machine learning (ML) models for various SE tasks. Some salient features of our tool are: (i) it works directly on source code (which need not be compilable), (ii) it currently supports Java and C#, (iii) it can analyze both method-level snippets and program-level snippets by using both intra-procedural and inter-procedural analysis, and (iv) it is easily extendable to other languages as it is built on tree-sitter - a widely used incremental parser that supports over 40 languages. We believe this easy-to-use code-view generation and customization tool will give impetus to research in source code representation learning methods and ML4SE. The source code and demonstration of our tool can be found at https://github.com/IBM/tree-sitter-codeviews and https://youtu.be/GER6U87FVbU, respectively.
Author Mathai, Alex
Sedamaki, Kranthi
Kumar, Atul
Mathews, Noble Saji
Chimalakonda, Sridhar
Das, Debeshee
Tamilselvam, Srikanth
Author_xml – sequence: 1
  givenname: Debeshee
  surname: Das
  fullname: Das, Debeshee
  email: debesheedas@gmail.com
  organization: Indian Institute of Technology Tirupati,India
– sequence: 2
  givenname: Noble Saji
  surname: Mathews
  fullname: Mathews, Noble Saji
  email: elbonleon@gmail.com
  organization: Indian Institute of Technology Tirupati,India
– sequence: 3
  givenname: Alex
  surname: Mathai
  fullname: Mathai, Alex
  email: alexmathai98@gmail.com
  organization: IBM Research,India
– sequence: 4
  givenname: Srikanth
  surname: Tamilselvam
  fullname: Tamilselvam, Srikanth
  email: srikanthtamilselvam@gmail.com
  organization: IBM Research,India
– sequence: 5
  givenname: Kranthi
  surname: Sedamaki
  fullname: Sedamaki, Kranthi
  email: skranthi4444@gmail.com
  organization: Indian Institute of Technology Tirupati,India
– sequence: 6
  givenname: Sridhar
  surname: Chimalakonda
  fullname: Chimalakonda, Sridhar
  email: sridhar.chimalakonda@gmail.com
  organization: Indian Institute of Technology Tirupati,India
– sequence: 7
  givenname: Atul
  surname: Kumar
  fullname: Kumar, Atul
  email: atulkumar@gmail.com
  organization: IBM Research,India
BookMark eNotj11LwzAYhaMouM39Ar3IH2h9kzRf3pVSpzAZuAnejbR5I5UtGW13ob_egl4dDs_DgTMnVzFFJOSOQc4Y2IdyW0vFuc05cJEDAIMLsrTaGiFBcGtVcUlmXBUiY1LzGzIfhi8AORU9I1W1ea0_HmlJdykdaEg9XWHE3o1d_KTVeRjTsftBT7fp3LdIq-SRvuGpxwHjOFkpDrfkOrjDgMv_XJD3p3pXPWfrzeqlKteZ46YYM1SoRON1I4xtbbCt46JoQDZKaCc5axh6EQo0gUkPARo_nZHa6TAhP7EFuf_b7RBxf-q7o-u_9wy4NVIZ8QvD_0yW
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/ASE56229.2023.00010
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISBN 9798350329964
EISSN 2643-1572
EndPage 2057
ExternalDocumentID 10298568
Genre orig-research
GroupedDBID 6IE
6IF
6IH
6IK
6IL
6IM
6IN
6J9
AAJGR
AAWTH
ABLEC
ACREN
ADYOE
ADZIZ
AFYQB
ALMA_UNASSIGNED_HOLDINGS
AMTXH
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
IPLJI
M43
OCL
RIE
RIL
ID FETCH-LOGICAL-a284t-e6e63bd7b389c9f9ca234b05b637a521b1ed3f4e8f15d0f0bd00057a7f1b1ded3
IEDL.DBID RIE
ISICitedReferencesCount 1
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001103357200196&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate Wed Aug 27 02:32:41 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a284t-e6e63bd7b389c9f9ca234b05b637a521b1ed3f4e8f15d0f0bd00057a7f1b1ded3
PageCount 4
ParticipantIDs ieee_primary_10298568
PublicationCentury 2000
PublicationDate 2023-Sept.-11
PublicationDateYYYYMMDD 2023-09-11
PublicationDate_xml – month: 09
  year: 2023
  text: 2023-Sept.-11
  day: 11
PublicationDecade 2020
PublicationTitle IEEE/ACM International Conference on Automated Software Engineering : [proceedings]
PublicationTitleAbbrev ASE
PublicationYear 2023
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0051577
ssib057256115
Score 2.2783048
Snippet Learning effective representations of source code is critical for any Machine Learning for Software Engineering (ML4SE) system. Inspired by natural language...
SourceID ieee
SourceType Publisher
StartPage 2054
SubjectTerms Codes
Computer languages
Process control
Representation learning
Semantics
Source coding
Static Analysis
Syntactics
Title COMEX: A Tool for Generating Customized Source Code Representations
URI https://ieeexplore.ieee.org/document/10298568
WOSCitedRecordID wos001103357200196&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV09T8MwED3RioGpfBTxLQ-sgThJ45itilqxUCpapG6VnbOlSpCgtmHg13N208LCwBbZsmydbb0n5947gFuruU57wnnsR06SI3SQJYkMrNSJjpUIJaIvNiFGo2w2k-NGrO61MMYYn3xm7tyn_5ePVVG7pzK64ZHMemnWgpYQ6UastT08NB9RAb7jvoTTQjQ2QzyU9_3JgKA-ctqUyJmahk4z-6ugiseTYeefKzmE7o8yj413mHMEe6Y8hs62NANrbuoJ5Pnz02D2wPpsWlVvjJgp2xhMuyxnltdE-d4XXwbZxD_es7xCw158VmwjRipXXXgdDqb5Y9DUSwgUgcw6MKlJY41CEwkppJWFiuKEIqPTWCiCac0NxjYxmeU9DG2o0TE2oYSlLqS-U2iXVWnOgEVIQywSV8I4USFmBTEla6NCa62UNOfQdUGZf2wsMebbeFz80X4JBy7uLtGC8ytor5e1uYb94nO9WC1v_EZ-A3sJncA
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3PS8MwFA46BT3NHxN_m4PXapqmTeNtlI2J2xxuwm4jaRIYaCv74cG_3pesm148eAsNpeW9hO8jed_3ELq1KlRJzJ3HPnWSHK6ClDERWKGYiiQnQmvfbIL3--l4LAaVWN1rYYwxvvjM3Lmhv8vXZb50R2Www6lI4yTdRjsxY5Ss5Frr5QNfBDIQbtgvIDXnldFQSMR9c9gCsKdOnUKdrSlxqtlfLVU8orTr__yXA9T40ebhwQZ1DtGWKY5Qfd2cAVd79Rhl2XOvNX7ATTwqyzcM3BSvLKZdnTPOlkD63qdfRuOhP77HWakNfvF1sZUcqZg30Gu7Nco6QdUxIZAAM4vAJCaJlOYKaEgurMgljRhERiURlwDUKjQ6ssykNow1sURpx9m45BamNMydoFpRFuYUYarhFauBLemISaLTHLiStTRXSkkpzBlquKBMPlamGJN1PM7_eH6D9jqjXnfSfew_XaB9lwNXdhGGl6i2mC3NFdrNPxfT-ezaJ_Ub4EihBw
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=IEEE%2FACM+International+Conference+on+Automated+Software+Engineering+%3A+%5Bproceedings%5D&rft.atitle=COMEX%3A+A+Tool+for+Generating+Customized+Source+Code+Representations&rft.au=Das%2C+Debeshee&rft.au=Mathews%2C+Noble+Saji&rft.au=Mathai%2C+Alex&rft.au=Tamilselvam%2C+Srikanth&rft.date=2023-09-11&rft.pub=IEEE&rft.eissn=2643-1572&rft.spage=2054&rft.epage=2057&rft_id=info:doi/10.1109%2FASE56229.2023.00010&rft.externalDocID=10298568