Cell2Doc: ML Pipeline for Generating Documentation in Computational Notebooks

Computational notebooks have become the go-to way for solving data-science problems. While they are designed to combine code and documentation, prior work shows that documentation is largely ignored by the developers because of the manual effort. Automated documentation generation can help, but exis...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	IEEE/ACM International Conference on Automated Software Engineering : [proceedings] s. 384 - 396
Hlavní autoři:	Mondal, Tamal, Barnett, Scott, Lal, Akash, Vedurada, Jyothi
Médium:	Konferenční příspěvek
Jazyk:	angličtina
Vydáno:	IEEE 11.09.2023
Témata:	AI4SW code documentation Codes Computational modeling Data science developer productivity Documentation Machine learning Manuals Notebooks documentation Pipelines software maintenance
ISSN:	2643-1572
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Popis
Shrnutí:	Computational notebooks have become the go-to way for solving data-science problems. While they are designed to combine code and documentation, prior work shows that documentation is largely ignored by the developers because of the manual effort. Automated documentation generation can help, but existing techniques fail to capture algorithmic details and developers often end up editing the generated text to provide more explanation and sub-steps. This paper proposes a novel machine-learning pipeline, Cell2Doc, for code cell documentation in Python data science notebooks. Our approach works by identifying different logical contexts within a code cell, generating documentation for them separately, and finally combining them to arrive at the documentation for the entire code cell. Cell2Doc takes advantage of the capabilities of existing pre-trained language models and improves their efficiency for code cell documentation. We also provide a new benchmark dataset for this task, along with a data-preprocessing pipeline that can be used to create new datasets. We also investigate an appropriate input representation for this task. Our automated evaluation suggests that our best input representation improves the pre-trained model's performance by 2.5x on average. Further, Cell2Doc achieves 1.33x improvement during human evaluation in terms of correctness, informativeness, and readability against the corresponding standalone pretrained model.
ISSN:	2643-1572
DOI:	10.1109/ASE56229.2023.00200