Can an LLM Find Its Way Around a Spreadsheet?

Spreadsheets are routinely used in business and scientific contexts, and one of the most vexing challenges is performing data cleaning prior to analysis and evaluation. The ad-hoc and arbitrary nature of data cleaning problems, such as typos, inconsistent formatting, missing values, and a lack of st...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	Proceedings / International Conference on Software Engineering s. 294 - 306
Hlavní autoři:	Lee, Cho-Ting, Neeser, Andrew, Xu, Shengzhe, Katyan, Jay, Cross, Patrick, Pathakota, Sharanya, Norman, Marigold, Simeone, John, Chandrasekaran, Jaganmohan, Ramakrishnan, Naren
Médium:	Konferenční příspěvek
Jazyk:	angličtina
Vydáno:	IEEE 26.04.2025
Témata:	Business Cleaning code generation Codes data cleaning Data preprocessing end-user programming Libraries LLMs Pipelines Programming Software engineering Standardization Vocabulary
ISSN:	1558-1225
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Popis
Shrnutí:	Spreadsheets are routinely used in business and scientific contexts, and one of the most vexing challenges is performing data cleaning prior to analysis and evaluation. The ad-hoc and arbitrary nature of data cleaning problems, such as typos, inconsistent formatting, missing values, and a lack of standardization, often creates the need for highly specialized pipelines. We ask whether an LLM can find its way around a spreadsheet and how to support end-users in taking their free-form data processing requests to fruition. Just like RAG retrieves context to answer users' queries, we demonstrate how we can retrieve elements from a code library to compose data preprocessing pipelines. Through comprehensive experiments, we demonstrate the quality of our system and how it is able to continuously augment its vocabulary by saving new codes and pipelines back to the code library for future retrieval.
ISSN:	1558-1225
DOI:	10.1109/ICSE55347.2025.00101