A syntax-guided multi-task learning approach for Turducken-style code generation

Due to the development of pre-trained language models, automated code generation techniques have shown great promise in recent years. However, the generated code will not always adhere to syntactic constraints of the target language, especially in the case of Turducken-style code, where declarative...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Empirical software engineering : an international journal Jg. 28; H. 6; S. 141
Hauptverfasser: Yang, Guang, Zhou, Yu, Chen, Xiang, Zhang, Xiangyu, Xu, Yiran, Han, Tingting, Chen, Taolue
Format: Journal Article
Sprache:Englisch
Veröffentlicht: New York Springer US 01.11.2023
Springer Nature B.V
Schlagworte:
ISSN:1382-3256, 1573-7616
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Due to the development of pre-trained language models, automated code generation techniques have shown great promise in recent years. However, the generated code will not always adhere to syntactic constraints of the target language, especially in the case of Turducken-style code, where declarative code snippets are embedded within imperative programs. In this study, we summarize three significant challenges in regards to syntactic constraints: (1) the efficient representation of syntactic constraints, (2) the effective integration of syntactic information, and (3) the scalable syntax-first decoding algorithm. To address these challenges, we propose a syntax-guided multi-task learning approach TurduckenGen. Specifically, we first explicitly append the type information to the code tokens to capture the representation of syntactic constraints. Then we formalize code generation with syntactic constraint representation as an auxiliary task to enable the model to learn the syntactic constraints of the code. Finally, the syntactically correct code is selected accurately from the multiple candidates with the help of the compiler feedback. Extensive experiments and comprehensive analysis demonstrate the effectiveness and general applicability of our approach after being compared with six state-of-the-art baselines on two Turducken-style code datasets. Finally, we conducted a human study and found the code quality generated by our approach is better than baselines in terms of code readability and semantic similarity.
Bibliographie:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:1382-3256
1573-7616
DOI:10.1007/s10664-023-10372-1