Revisiting Information Retrieval and Deep Learning Approaches for Code Summarization

Code summarization refers to the procedure of creating short descriptions that outline the semantics of source code snippets. Existing code summarization approaches can be broadly classified into Information Retrieval (IR)-based and Deep Learning (DL)-based approaches. However, their effectiveness,...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	Proceedings (IEEE/ACM International Conference on Software Engineering Companion. Online) s. 328 - 329
Hlavní autoři:	Zhu, Tingwei, Li, Zhong, Pan, Minxue, Shi, Chaoxuan, Zhang, Tian, Pei, Yu, Li, Xuandong
Médium:	Konferenční příspěvek
Jazyk:	angličtina
Vydáno:	IEEE 01.05.2023
Témata:	Code summarization Codes Deep learning empirical study information retrieval Java Performance evaluation Semantics Source coding Syntactics
ISSN:	2574-1934
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Abstract	Code summarization refers to the procedure of creating short descriptions that outline the semantics of source code snippets. Existing code summarization approaches can be broadly classified into Information Retrieval (IR)-based and Deep Learning (DL)-based approaches. However, their effectiveness, and especially their strengths and weaknesses, remain largely understudied. Existing evaluations use different benchmarks and metrics, making performance comparisons of these approaches susceptible to bias and potentially yielding misleading results. For example, the DL-based approaches typically show better code summarization performance in their original papers [1], [2]. However, Gros et al. [3] report that a naive IR approach could achieve comparable (or even better) performance to the DL-based ones. In addition, some recent work [4], [5] suggests that incorporating IR techniques can improve the DL-based approaches. To further advance code summarization techniques, it is critical that we have a good understanding of how IR-based and DL-based approaches perform on different datasets and in terms of different metrics. Prior works have studied some aspects of code summarization, such as the factors affecting performance evaluation [6] and the importance of data preprocessing [7], etc. In this paper, we focus on the study of the IR-based and DL-based code summarization approaches to enhance the understanding and design of more advanced techniques. We first compare the IR-based and DL-based approaches under the same experimental settings and benchmarks, then study their strengths and limitations through quantitative and qualitative analyses. Finally, we propose a simpler but effective strategy to combine IR and DL to further improve code summarization. Four IR-based approaches and two DL-based approaches are investigated with regard to representativeness and diversity. For IR-based approaches, we select three BM25-based approaches (i.e., BM25-spl, BM25-ast, BM25-alpha) and one nearest neighbor-based approach NNGen [8], which are often compared as baselines in prior works. They retrieve the most similar code from the database and directly output the corresponding summary. BM25-based approaches are implemented by Lucene [9]. Taking code forms as input, BM25-spl splits the CamelCase and snake_case in original source code tokens, BM25-ast obtains sequence representations using pre-order Abstract Syntax Tree (AST) traversal, and BM25-alpha keeps only the alpha tokens in the code. For DL-based approaches, we choose the state-of-the-art pre-trained model PLBART [2] and the trained-from-scratch model SiT[1]. We adopt four widely used Java datasets, namely TLC [10], CSN [11], HDC [12], and FCM [13] as our subject datasets. TLC and HDC are method-split datasets, where methods in the same project are randomly split into training/validation/test sets. CSN and FCM are project-split datasets, where examples from the same project exist in only one partition. We further process the four datasets to build cleaner datasets by removing examples that have syntax errors, empty method bodies, and too long or too short sequence lengths, etc. We also remove the duplicate examples in the validation and test sets. To comprehensively and systematically evaluate the performance of the code summarization approaches, we adopt three widely used metrics, i.e., BLEU (both C-BLEU and S-BLEU are included), ROUGE, and METEOR, in our experiments.
AbstractList	Code summarization refers to the procedure of creating short descriptions that outline the semantics of source code snippets. Existing code summarization approaches can be broadly classified into Information Retrieval (IR)-based and Deep Learning (DL)-based approaches. However, their effectiveness, and especially their strengths and weaknesses, remain largely understudied. Existing evaluations use different benchmarks and metrics, making performance comparisons of these approaches susceptible to bias and potentially yielding misleading results. For example, the DL-based approaches typically show better code summarization performance in their original papers [1], [2]. However, Gros et al. [3] report that a naive IR approach could achieve comparable (or even better) performance to the DL-based ones. In addition, some recent work [4], [5] suggests that incorporating IR techniques can improve the DL-based approaches. To further advance code summarization techniques, it is critical that we have a good understanding of how IR-based and DL-based approaches perform on different datasets and in terms of different metrics. Prior works have studied some aspects of code summarization, such as the factors affecting performance evaluation [6] and the importance of data preprocessing [7], etc. In this paper, we focus on the study of the IR-based and DL-based code summarization approaches to enhance the understanding and design of more advanced techniques. We first compare the IR-based and DL-based approaches under the same experimental settings and benchmarks, then study their strengths and limitations through quantitative and qualitative analyses. Finally, we propose a simpler but effective strategy to combine IR and DL to further improve code summarization. Four IR-based approaches and two DL-based approaches are investigated with regard to representativeness and diversity. For IR-based approaches, we select three BM25-based approaches (i.e., BM25-spl, BM25-ast, BM25-alpha) and one nearest neighbor-based approach NNGen [8], which are often compared as baselines in prior works. They retrieve the most similar code from the database and directly output the corresponding summary. BM25-based approaches are implemented by Lucene [9]. Taking code forms as input, BM25-spl splits the CamelCase and snake_case in original source code tokens, BM25-ast obtains sequence representations using pre-order Abstract Syntax Tree (AST) traversal, and BM25-alpha keeps only the alpha tokens in the code. For DL-based approaches, we choose the state-of-the-art pre-trained model PLBART [2] and the trained-from-scratch model SiT[1]. We adopt four widely used Java datasets, namely TLC [10], CSN [11], HDC [12], and FCM [13] as our subject datasets. TLC and HDC are method-split datasets, where methods in the same project are randomly split into training/validation/test sets. CSN and FCM are project-split datasets, where examples from the same project exist in only one partition. We further process the four datasets to build cleaner datasets by removing examples that have syntax errors, empty method bodies, and too long or too short sequence lengths, etc. We also remove the duplicate examples in the validation and test sets. To comprehensively and systematically evaluate the performance of the code summarization approaches, we adopt three widely used metrics, i.e., BLEU (both C-BLEU and S-BLEU are included), ROUGE, and METEOR, in our experiments.
Author	Li, Xuandong Li, Zhong Shi, Chaoxuan Pan, Minxue Pei, Yu Zhu, Tingwei Zhang, Tian
Author_xml	– sequence: 1 givenname: Tingwei surname: Zhu fullname: Zhu, Tingwei organization: Nanjing University,Nanjing,China – sequence: 2 givenname: Zhong surname: Li fullname: Li, Zhong organization: Nanjing University,Nanjing,China – sequence: 3 givenname: Minxue surname: Pan fullname: Pan, Minxue organization: Nanjing University,Nanjing,China – sequence: 4 givenname: Chaoxuan surname: Shi fullname: Shi, Chaoxuan organization: Nanjing University,Nanjing,China – sequence: 5 givenname: Tian surname: Zhang fullname: Zhang, Tian organization: Nanjing University,Nanjing,China – sequence: 6 givenname: Yu surname: Pei fullname: Pei, Yu organization: The Hong Kong Polytechnic University,Hong Kong,China – sequence: 7 givenname: Xuandong surname: Li fullname: Li, Xuandong organization: Nanjing University,Nanjing,China
BookMark	eNotjk1Lw0AURUdRsNb8Axezcpc6My-ZySxLbDUQENq6Lq-TFx1pJiGJBf31xo_Vhcs9h3vNLkIbiLE7KRZSCntf5NtVnLdNh8G3Ic10li2UULAQQlh5xiJrbAapAKU0mHM2U6lJYmkhuWLRMLxPM1ACEmtmbLehkx_86MMrL0Ld9g2Ok5RvaOw9nfDIMVT8gajjJWEffnbLrutbdG808AngeVsR3340Dfb-65e-YZc1HgeK_nPOXtarXf4Ul8-PRb4sYwShx9jpTCU16dQJCZXGg0xcTeQqoMpqRJAGFB6UIGdrlxJNferQaCclJGRgzm7_vJ6I9l3vpwufeymkUZmx8A2qAllX
CODEN	IEEPAD
ContentType	Conference Proceeding
DBID	6IE 6IL CBEJK RIE RIL
DOI	10.1109/ICSE-Companion58688.2023.00091
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
EISBN	9798350322637
EISSN	2574-1934
EndPage	329
ExternalDocumentID	10172879
Genre	orig-research
GroupedDBID	6IE 6IF 6IL 6IN AAJGR AAWTH ABLEC ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK OCL RIE RIL
ID	FETCH-LOGICAL-a306t-c6824fe65c013d6ab14cfeecd3ed96aa31732ab20ec9fc5ee3ed5ca76c1134e73
IEDL.DBID	RIE
ISICitedReferencesCount	5
ISICitedReferencesURI	http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001032641300079&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate	Wed Aug 27 02:21:00 EDT 2025
IsPeerReviewed	false
IsScholarly	true
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-a306t-c6824fe65c013d6ab14cfeecd3ed96aa31732ab20ec9fc5ee3ed5ca76c1134e73
PageCount	2
ParticipantIDs	ieee_primary_10172879
PublicationCentury	2000
PublicationDate	2023-May
PublicationDateYYYYMMDD	2023-05-01
PublicationDate_xml	– month: 05 year: 2023 text: 2023-May
PublicationDecade	2020
PublicationTitle	Proceedings (IEEE/ACM International Conference on Software Engineering Companion. Online)
PublicationTitleAbbrev	ICSE-COMPANION
PublicationYear	2023
Publisher	IEEE
Publisher_xml	– name: IEEE
SSID	ssj0003203497 ssib051921307
Score	2.2798862
Snippet	Code summarization refers to the procedure of creating short descriptions that outline the semantics of source code snippets. Existing code summarization...
SourceID	ieee
SourceType	Publisher
StartPage	328
SubjectTerms	Code summarization Codes Deep learning empirical study information retrieval Java Performance evaluation Semantics Source coding Syntactics
Title	Revisiting Information Retrieval and Deep Learning Approaches for Code Summarization
URI	https://ieeexplore.ieee.org/document/10172879
WOSCitedRecordID	wos001032641300079&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlZ1NSwMxEIaDLSKeVKz4TQ7iLbZNNsnuUapFQUppq_RWksmseNmWfvj7TbKpevHgbQkENrOTncnuvM8QcpNbUI4XhiFaYJlCy3JlkXGhhCt9PILYkuXtRQ8G-XRaDJNYPWphEDEWn-FduIz_8t0cNuFTWTu4j8_wiwZpaK1qsdbWeWQAe23ZVuE1LHhAr-g9cpu4mu3n3viR1dvMr1nmKg-1XTwgTjuB0fmrvUqMLv2Df97XIWn96PTo8DsCHZEdrI7JZBQF46GcmSaxUTA-HcXeWd6xqKkcfUBc0ARXfaf3iSyOK-on0N7cIR1HWVuSabbIa_9x0ntiqXcCM_4QsGagcp6VqCT4HM8pY7sZlIjgBLpCGePTBsGN5R2EogSJ6MclGK2g2xUZanFCmtW8wlNCCw1aC6eVlC7zNjWZ9fMDB6fsgI91Z6QVTDJb1HiM2dYa53-MX5D9YPW6avCSNNfLDV6RXfhcf6yW1_GhfgEAc6On
linkProvider	IEEE
linkToHtml	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlZ3PT8IwFMcbRaOe1Ijxtz0Yb5XRn9vRIAQiEgJouJGufTNeBoHh32_bDfXiwdvSpMn6-tb3tr3v5yF0F6dGWppoApAawiWkJJYpEMoks5mLRya0ZHnrq8Egnk6TYSVWD1oYAAjFZ_DgL8O_fDs3a_-prOHdx2X4yTbaEZzTqJRrbdxHeLTXhm7lD2JGPXxF7aH7iqzZ6LXGbVI-aG7VIpaxr-6iHnIaeUrnrwYrIb50Dv95Z0eo_qPUw8PvGHSMtiA_QZNRkIz7gmZcyY28-fEodM9yroV1bvETwAJXeNV3_FixxWGF3QTcmlvA4yBsq4SadfTaaU9aXVJ1TyDavQYUxMiY8gykMC7Ls1KnTW4yAGMZ2ERq7RIHRnVKIzBJZgSAGxdGK2maTcZBsVNUy-c5nCGcKKMUs0oKYbmzqeapm-9JOFlkXLQ7R3VvktmiBGTMNta4-GP8Fu13Jy_9Wb83eL5EB34HyhrCK1Qrlmu4Rrvms_hYLW_CBn8BsvKm7g
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+%28IEEE%2FACM+International+Conference+on+Software+Engineering+Companion.+Online%29&rft.atitle=Revisiting+Information+Retrieval+and+Deep+Learning+Approaches+for+Code+Summarization&rft.au=Zhu%2C+Tingwei&rft.au=Li%2C+Zhong&rft.au=Pan%2C+Minxue&rft.au=Shi%2C+Chaoxuan&rft.date=2023-05-01&rft.pub=IEEE&rft.eissn=2574-1934&rft.spage=328&rft.epage=329&rft_id=info:doi/10.1109%2FICSE-Companion58688.2023.00091&rft.externalDocID=10172879