Multilingual training for Software Engineering
Well-trained machine-learning models, which leverage large amounts of open-source software data, have now become an interesting approach to automating many software engineering tasks. Several SE tasks have all been subject to this approach, with performance gradually improving over the past several...
Uložené v:
| Vydané v: | 2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE) s. 1443 - 1455 |
|---|---|
| Hlavní autori: | , |
| Médium: | Konferenčný príspevok.. |
| Jazyk: | English |
| Vydavateľské údaje: |
ACM
01.05.2022
|
| Predmet: | |
| ISSN: | 1558-1225 |
| On-line prístup: | Získať plný text |
| Tagy: |
Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
|
| Abstract | Well-trained machine-learning models, which leverage large amounts of open-source software data, have now become an interesting approach to automating many software engineering tasks. Several SE tasks have all been subject to this approach, with performance gradually improving over the past several years with better models and training methods. More, and more diverse, clean, labeled data is better for training; but constructing good-quality datasets is time-consuming and challenging. Ways of augmenting the volume and diversity of clean, labeled data generally have wide applicability. For some languages (e.g., Ruby) labeled data is less abundant; in others (e.g., JavaScript) the available data maybe more focused on some application domains, and thus less diverse. As a way around such data bottlenecks, we present evidence suggesting that human-written code in different languages (which performs the same function), is rather similar, and particularly preserving of identifier naming patterns; we further present evidence suggesting that identifiers are a very important element of training data for software engineering tasks. We leverage this rather fortuitous phenomenon to find evidence that available multilingual training data (across different languages) can be used to amplify performance. We study this for 3 different tasks: code summarization, code retrieval, and function naming. We note that this data-augmenting approach is broadly compatible with different tasks, languages, and machine-learning models. |
|---|---|
| AbstractList | Well-trained machine-learning models, which leverage large amounts of open-source software data, have now become an interesting approach to automating many software engineering tasks. Several SE tasks have all been subject to this approach, with performance gradually improving over the past several years with better models and training methods. More, and more diverse, clean, labeled data is better for training; but constructing good-quality datasets is time-consuming and challenging. Ways of augmenting the volume and diversity of clean, labeled data generally have wide applicability. For some languages (e.g., Ruby) labeled data is less abundant; in others (e.g., JavaScript) the available data maybe more focused on some application domains, and thus less diverse. As a way around such data bottlenecks, we present evidence suggesting that human-written code in different languages (which performs the same function), is rather similar, and particularly preserving of identifier naming patterns; we further present evidence suggesting that identifiers are a very important element of training data for software engineering tasks. We leverage this rather fortuitous phenomenon to find evidence that available multilingual training data (across different languages) can be used to amplify performance. We study this for 3 different tasks: code summarization, code retrieval, and function naming. We note that this data-augmenting approach is broadly compatible with different tasks, languages, and machine-learning models. |
| Author | Ahmed, Toufique Devanbu, Premkumar |
| Author_xml | – sequence: 1 givenname: Toufique surname: Ahmed fullname: Ahmed, Toufique email: tfahmed@ucdavis.edu organization: University of California,Davis Davis,California,USA – sequence: 2 givenname: Premkumar surname: Devanbu fullname: Devanbu, Premkumar email: ptdevanbu@ucdavis.edu organization: University of California,Davis Davis,California,USA |
| BookMark | eNotjE1PwzAQRA0Cibb0zIFL_kCC12vH9hFV5UNq1QO9V9tkXRkFBzmpEP-eCDi90bzRzMVV6hMLcQeyAtDmAQ1IKbH6pfYXYumtm4RErxTApZiBMa4EpcyNmA_D-7SutfczUW3P3Ri7mE5n6ooxU0xTLkKfi7c-jF-UuVinU0zMeRK34jpQN_Dynwuxf1rvVy_lZvf8unrclIRQj2WtJbZgmwAcGocOrGuINTXO29ZMPVuPRhtvqVVwJGwlBDwGkhZrVrgQ93-3kZkPnzl-UP4-eOs1qBp_APc5RBk |
| CODEN | IEEPAD |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IH CBEJK ESBDL RIE RIO |
| DOI | 10.1145/3510003.3510049 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Xplore Open Access Journals IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP) 1998-present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Computer Science |
| EISBN | 9781450392211 1450392210 |
| EISSN | 1558-1225 |
| EndPage | 1455 |
| ExternalDocumentID | 9794126 |
| Genre | orig-research |
| GrantInformation_xml | – fundername: U.S. National Science Foundation grantid: 1414172,2107592 funderid: 10.13039/100000001 |
| GroupedDBID | -~X .4S .DC 123 23M 29O 5VS 6IE 6IF 6IH 6IK 6IL 6IM 6IN 8US AAJGR AAWTH ABLEC ADZIZ AFFNX ALMA_UNASSIGNED_HOLDINGS APO ARCSS AVWKF BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO EDO ESBDL FEDTE I-F I07 IEGSK IJVOP IPLJI M43 OCL RIE RIL RIO RNS XOL |
| ID | FETCH-LOGICAL-a316t-6403d17cf1efc838178cae4ac897d57cfe79354597ad21ba3d01f3bfa0736e23 |
| IEDL.DBID | RIE |
| ISICitedReferencesCount | 37 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000832185400117&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| IngestDate | Wed Aug 27 02:28:32 EDT 2025 |
| IsDoiOpenAccess | true |
| IsOpenAccess | true |
| IsPeerReviewed | false |
| IsScholarly | true |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-a316t-6403d17cf1efc838178cae4ac897d57cfe79354597ad21ba3d01f3bfa0736e23 |
| OpenAccessLink | https://ieeexplore.ieee.org/document/9794126 |
| PageCount | 13 |
| ParticipantIDs | ieee_primary_9794126 |
| PublicationCentury | 2000 |
| PublicationDate | 2022-May |
| PublicationDateYYYYMMDD | 2022-05-01 |
| PublicationDate_xml | – month: 05 year: 2022 text: 2022-May |
| PublicationDecade | 2020 |
| PublicationTitle | 2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE) |
| PublicationTitleAbbrev | ICSE |
| PublicationYear | 2022 |
| Publisher | ACM |
| Publisher_xml | – name: ACM |
| SSID | ssj0006499 ssj0002871777 |
| Score | 2.5017335 |
| Snippet | Well-trained machine-learning models, which leverage large amounts of open-source software data, have now become an interesting approach to automating many... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 1443 |
| SubjectTerms | code search code summarization Codes Computer languages deep learning Machine learning method name prediction Natural languages Syntactics Training Training data |
| Title | Multilingual training for Software Engineering |
| URI | https://ieeexplore.ieee.org/document/9794126 |
| WOSCitedRecordID | wos000832185400117&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LTwIxEJ4A8eAJFYzv9ODRLtvtbh9nI_FESOTAjUwfm5gYMAj692nLsnLw4qlNT52088i03_cBPCpjtNG5oz5HS8uKW2oKVlNjUWHhVIiPJolNyMlEzed62oGnFgvjvU-fz3wWp-kt363sNrbKRjpcHlaILnSllHusVttPiZV_orZrorAIpXxD5cPKasSr2MjmWRojceaRlkpKJeP-_zZxBsNfTB6ZttnmHDp-eQH9gygDaXx0AFmC1EaQ-RY_yEEBgoTalLyFkPuDa0-OSAiHMBu_zJ5faSOKQJEzsaGizLlj0tbM11ZFfj1l0ZdolZauCus-eFwoi7REVzCD3OWs5qbG4MvCF_wSesvV0l8BqaMwJo90OEKURljNMEQ_hsZgyZSqrmEQrV987mkvFo3hN38v38JpEZEB6S_gHfQ2662_hxP7vXn_Wj-ks9oBuAyUAg |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV07T0IxFD5BNNEJFYxvOzhauH3c3nY2EoxISGRgI33dxMQAQdC_b1suyODi1KZTT9rzyGm_7wO4l8YoozKHfaYt5jmz2FBSYmO11NTJEB9NEpsoBgM5HqthDR62WBjvffp85ttxmt7y3cyuYquso8LlIVTswX7OOSVrtNa2oxJr_0RuV8VhEYr5isyH8LzD8tjKZu00RurMHTWVlEy6jf9t4xhav6g8NNzmmxOo-ekpNDayDKjy0ia0E6g2wsxX-gNtNCBQqE7RWwi633rh0Q4NYQtG3afRYw9XsghYMyKWWPCMOVLYkvjSysiwJ632XFupCpeHdR98LhRGqtCOEqOZy0jJTKmDNwtP2RnUp7OpPwdURmlMFglxhOBGWEV0iH9EG6M5kTK_gGa0fjJfE19MKsMv_16-g8Pe6LU_6T8PXq7giEacQPoZeA315WLlb-DAfi3fPxe36dx-AJ_kl0k |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2022+IEEE%2FACM+44th+International+Conference+on+Software+Engineering+%28ICSE%29&rft.atitle=Multilingual+training+for+Software+Engineering&rft.au=Ahmed%2C+Toufique&rft.au=Devanbu%2C+Premkumar&rft.date=2022-05-01&rft.pub=ACM&rft.eissn=1558-1225&rft.spage=1443&rft.epage=1455&rft_id=info:doi/10.1145%2F3510003.3510049&rft.externalDocID=9794126 |