Cell2Doc: ML Pipeline for Generating Documentation in Computational Notebooks

Computational notebooks have become the go-to way for solving data-science problems. While they are designed to combine code and documentation, prior work shows that documentation is largely ignored by the developers because of the manual effort. Automated documentation generation can help, but exis...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE/ACM International Conference on Automated Software Engineering : [proceedings] S. 384 - 396
Hauptverfasser:	Mondal, Tamal, Barnett, Scott, Lal, Akash, Vedurada, Jyothi
Format:	Tagungsbericht
Sprache:	Englisch
Veröffentlicht:	IEEE 11.09.2023
Schlagworte:	AI4SW code documentation Codes Computational modeling Data science developer productivity Documentation Machine learning Manuals Notebooks documentation Pipelines software maintenance
ISSN:	2643-1572
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Computational notebooks have become the go-to way for solving data-science problems. While they are designed to combine code and documentation, prior work shows that documentation is largely ignored by the developers because of the manual effort. Automated documentation generation can help, but existing techniques fail to capture algorithmic details and developers often end up editing the generated text to provide more explanation and sub-steps. This paper proposes a novel machine-learning pipeline, Cell2Doc, for code cell documentation in Python data science notebooks. Our approach works by identifying different logical contexts within a code cell, generating documentation for them separately, and finally combining them to arrive at the documentation for the entire code cell. Cell2Doc takes advantage of the capabilities of existing pre-trained language models and improves their efficiency for code cell documentation. We also provide a new benchmark dataset for this task, along with a data-preprocessing pipeline that can be used to create new datasets. We also investigate an appropriate input representation for this task. Our automated evaluation suggests that our best input representation improves the pre-trained model's performance by 2.5x on average. Further, Cell2Doc achieves 1.33x improvement during human evaluation in terms of correctness, informativeness, and readability against the corresponding standalone pretrained model.
ISSN:	2643-1572
DOI:	10.1109/ASE56229.2023.00200