Multilingual training for Software Engineering

Well-trained machine-learning models, which leverage large amounts of open-source software data, have now become an interesting approach to automating many software engineering tasks. Several SE tasks have all been subject to this approach, with performance gradually improving over the past several...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE) s. 1443 - 1455
Hlavní autori: Ahmed, Toufique, Devanbu, Premkumar
Médium: Konferenčný príspevok..
Jazyk:English
Vydavateľské údaje: ACM 01.05.2022
Predmet:
ISSN:1558-1225
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Abstract Well-trained machine-learning models, which leverage large amounts of open-source software data, have now become an interesting approach to automating many software engineering tasks. Several SE tasks have all been subject to this approach, with performance gradually improving over the past several years with better models and training methods. More, and more diverse, clean, labeled data is better for training; but constructing good-quality datasets is time-consuming and challenging. Ways of augmenting the volume and diversity of clean, labeled data generally have wide applicability. For some languages (e.g., Ruby) labeled data is less abundant; in others (e.g., JavaScript) the available data maybe more focused on some application domains, and thus less diverse. As a way around such data bottlenecks, we present evidence suggesting that human-written code in different languages (which performs the same function), is rather similar, and particularly preserving of identifier naming patterns; we further present evidence suggesting that identifiers are a very important element of training data for software engineering tasks. We leverage this rather fortuitous phenomenon to find evidence that available multilingual training data (across different languages) can be used to amplify performance. We study this for 3 different tasks: code summarization, code retrieval, and function naming. We note that this data-augmenting approach is broadly compatible with different tasks, languages, and machine-learning models.
AbstractList Well-trained machine-learning models, which leverage large amounts of open-source software data, have now become an interesting approach to automating many software engineering tasks. Several SE tasks have all been subject to this approach, with performance gradually improving over the past several years with better models and training methods. More, and more diverse, clean, labeled data is better for training; but constructing good-quality datasets is time-consuming and challenging. Ways of augmenting the volume and diversity of clean, labeled data generally have wide applicability. For some languages (e.g., Ruby) labeled data is less abundant; in others (e.g., JavaScript) the available data maybe more focused on some application domains, and thus less diverse. As a way around such data bottlenecks, we present evidence suggesting that human-written code in different languages (which performs the same function), is rather similar, and particularly preserving of identifier naming patterns; we further present evidence suggesting that identifiers are a very important element of training data for software engineering tasks. We leverage this rather fortuitous phenomenon to find evidence that available multilingual training data (across different languages) can be used to amplify performance. We study this for 3 different tasks: code summarization, code retrieval, and function naming. We note that this data-augmenting approach is broadly compatible with different tasks, languages, and machine-learning models.
Author Ahmed, Toufique
Devanbu, Premkumar
Author_xml – sequence: 1
  givenname: Toufique
  surname: Ahmed
  fullname: Ahmed, Toufique
  email: tfahmed@ucdavis.edu
  organization: University of California,Davis Davis,California,USA
– sequence: 2
  givenname: Premkumar
  surname: Devanbu
  fullname: Devanbu, Premkumar
  email: ptdevanbu@ucdavis.edu
  organization: University of California,Davis Davis,California,USA
BookMark eNotjE1PwzAQRA0Cibb0zIFL_kCC12vH9hFV5UNq1QO9V9tkXRkFBzmpEP-eCDi90bzRzMVV6hMLcQeyAtDmAQ1IKbH6pfYXYumtm4RErxTApZiBMa4EpcyNmA_D-7SutfczUW3P3Ri7mE5n6ooxU0xTLkKfi7c-jF-UuVinU0zMeRK34jpQN_Dynwuxf1rvVy_lZvf8unrclIRQj2WtJbZgmwAcGocOrGuINTXO29ZMPVuPRhtvqVVwJGwlBDwGkhZrVrgQ93-3kZkPnzl-UP4-eOs1qBp_APc5RBk
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IH
CBEJK
ESBDL
RIE
RIO
DOI 10.1145/3510003.3510049
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan (POP) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Xplore Open Access Journals
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISBN 9781450392211
1450392210
EISSN 1558-1225
EndPage 1455
ExternalDocumentID 9794126
Genre orig-research
GrantInformation_xml – fundername: U.S. National Science Foundation
  grantid: 1414172,2107592
  funderid: 10.13039/100000001
GroupedDBID -~X
.4S
.DC
123
23M
29O
5VS
6IE
6IF
6IH
6IK
6IL
6IM
6IN
8US
AAJGR
AAWTH
ABLEC
ADZIZ
AFFNX
ALMA_UNASSIGNED_HOLDINGS
APO
ARCSS
AVWKF
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
EDO
ESBDL
FEDTE
I-F
I07
IEGSK
IJVOP
IPLJI
M43
OCL
RIE
RIL
RIO
RNS
XOL
ID FETCH-LOGICAL-a316t-6403d17cf1efc838178cae4ac897d57cfe79354597ad21ba3d01f3bfa0736e23
IEDL.DBID RIE
ISICitedReferencesCount 37
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000832185400117&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate Wed Aug 27 02:28:32 EDT 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a316t-6403d17cf1efc838178cae4ac897d57cfe79354597ad21ba3d01f3bfa0736e23
OpenAccessLink https://ieeexplore.ieee.org/document/9794126
PageCount 13
ParticipantIDs ieee_primary_9794126
PublicationCentury 2000
PublicationDate 2022-May
PublicationDateYYYYMMDD 2022-05-01
PublicationDate_xml – month: 05
  year: 2022
  text: 2022-May
PublicationDecade 2020
PublicationTitle 2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE)
PublicationTitleAbbrev ICSE
PublicationYear 2022
Publisher ACM
Publisher_xml – name: ACM
SSID ssj0006499
ssj0002871777
Score 2.5017335
Snippet Well-trained machine-learning models, which leverage large amounts of open-source software data, have now become an interesting approach to automating many...
SourceID ieee
SourceType Publisher
StartPage 1443
SubjectTerms code search
code summarization
Codes
Computer languages
deep learning
Machine learning
method name prediction
Natural languages
Syntactics
Training
Training data
Title Multilingual training for Software Engineering
URI https://ieeexplore.ieee.org/document/9794126
WOSCitedRecordID wos000832185400117&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1NTwIxEJ0A8eAJFYzf6cGjhW2729k9G4knQiIHbqTtThMTAwZB_z5tWZCDF09temj6NZ3JtO89gMfMFkY6W3Gtpec5ZsiNQ82l8wW64PLQ1klsAsfjcjarJi14OmBhiCh9PqNBrKa3_HrpNjFVNqzC4RFSt6GNiDus1iGfEiP_RG3X3MI6hPINlY_Ii6EqYiJbDVIZiTOPtFSSKxl1_zeIM-j_YvLY5OBtzqFFiwvo7kUZWGOjPRgkSG0EmW_MB9srQLAQm7K3cOX-mBWxIxLCPkxHL9PnV96IInCjhF5znWeqFui8IO_KyK9XOkO5cWWFdVheT8HiQlhUoamlsEbVmfDKehNsWZNUl9BZLBd0BUw4YX1OigSVeeVCPy4aMEbFl9LY4hp6cfbzzx3txbyZ-M3fzbdwKiMyIP0FvIPOerWhezhx3-v3r9VD2qstkOyT7A
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LTwIxEJ4gmugJFYxv9-DRwvax7e7ZSDAiIZEDN9J2p4mJAYKgf9-2LMjBi6c2PTR9TWcy7fd9APepyTSzpiBSMkeEShXRVknCrMuU9S5PmTKKTajBIB-Pi2ENHrZYGESMn8-wHarxLb-c2VVIlXUKf3gok3uwnwnB6Bqttc2ohNg_kttV97D0wXxF5kNF1uFZSGXzdiwDdeaOmkp0Jt3G_4ZxDK1fVF4y3PqbE6jh9BQaG1mGpLLSJrQjqDbAzFf6I9loQCQ-Ok3e_KX7rReY7NAQtmDUfRo99kgli0A0p3JJpEh5SZV1FJ3NA8NebjUKbfNClX6BHXqb84FRoXTJqNG8TKnjxmlvzRIZP4P6dDbFc0iopcYJ5EgxF4X1_dhgwipovuTaZBfQDLOfzNfEF5Nq4pd_N9_BYW_02p_0nwcvV3DEAk4g_gy8hvpyscIbOLBfy_fPxW3ctx92WJcz
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2022+IEEE%2FACM+44th+International+Conference+on+Software+Engineering+%28ICSE%29&rft.atitle=Multilingual+training+for+Software+Engineering&rft.au=Ahmed%2C+Toufique&rft.au=Devanbu%2C+Premkumar&rft.date=2022-05-01&rft.pub=ACM&rft.eissn=1558-1225&rft.spage=1443&rft.epage=1455&rft_id=info:doi/10.1145%2F3510003.3510049&rft.externalDocID=9794126