Generating reliable software project task flows using large language models through prompt engineering and robust evaluation.
Saved in:
| Title: | Generating reliable software project task flows using large language models through prompt engineering and robust evaluation. |
|---|---|
| Authors: | Sarim, Mohammed1 (AUTHOR), Masood, Faraz1 (AUTHOR), Maheshwari, Manas1 (AUTHOR), Faridi, Arman Rasool1 (AUTHOR), Shamsan, Ali Haider2 (AUTHOR) a.h.shamsan@ust.edu |
| Source: | Scientific Reports. 10/8/2025, Vol. 15 Issue 1, p1-25. 25p. |
| Subject Terms: | *LANGUAGE models, *PROJECT management software, *EVALUATION methodology |
| Abstract: | Large Language Models (LLMs) offer promising capabilities for converting unstructured software documentation into structured task flows, yet their outputs often lack procedural reliability critical for software engineering. This paper presents a comprehensive framework that benchmarks five leading LLMs-Gemini 2.5 Pro, Grok 3, GPT-Omni, DeepSeek-R1, and LLaMA-3-across five prompting strategies, including Zero-Shot, Chain-of-Thought, and ISO 21502-Guided, using real-world software tutorials from the "Build Your Own X" repository. We introduce the Hybrid Semantic Similarity Metric (HSSM), which combines SentenceTransformer embeddings with context-aware key-term overlap, capturing both semantic fidelity and procedural coherence. Compared to traditional metrics like BERTScore, SBERT, and USE, HSSM demonstrates significantly lower variance (CV: 1.5–2.9%) and stronger correlation with human judgments. Our results show that even minimal prompting (Zero-Shot) can yield highly aligned task flows (HSSM: 96.33%) when evaluated with robust metrics. This work offers a scalable evaluation paradigm for LLM-assisted software planning, with implications for AI-driven project management, prompt engineering, and procedural generation in software education and tooling. [ABSTRACT FROM AUTHOR] |
| Database: | Academic Search Index |
| Abstract: | Large Language Models (LLMs) offer promising capabilities for converting unstructured software documentation into structured task flows, yet their outputs often lack procedural reliability critical for software engineering. This paper presents a comprehensive framework that benchmarks five leading LLMs-Gemini 2.5 Pro, Grok 3, GPT-Omni, DeepSeek-R1, and LLaMA-3-across five prompting strategies, including Zero-Shot, Chain-of-Thought, and ISO 21502-Guided, using real-world software tutorials from the "Build Your Own X" repository. We introduce the Hybrid Semantic Similarity Metric (HSSM), which combines SentenceTransformer embeddings with context-aware key-term overlap, capturing both semantic fidelity and procedural coherence. Compared to traditional metrics like BERTScore, SBERT, and USE, HSSM demonstrates significantly lower variance (CV: 1.5–2.9%) and stronger correlation with human judgments. Our results show that even minimal prompting (Zero-Shot) can yield highly aligned task flows (HSSM: 96.33%) when evaluated with robust metrics. This work offers a scalable evaluation paradigm for LLM-assisted software planning, with implications for AI-driven project management, prompt engineering, and procedural generation in software education and tooling. [ABSTRACT FROM AUTHOR] |
|---|---|
| ISSN: | 20452322 |
| DOI: | 10.1038/s41598-025-19170-9 |
Full Text Finder
Nájsť tento článok vo Web of Science