Generating reliable software project task flows using large language models through prompt engineering and robust evaluation.

Saved in:
Bibliographic Details
Title: Generating reliable software project task flows using large language models through prompt engineering and robust evaluation.
Authors: Sarim, Mohammed1 (AUTHOR), Masood, Faraz1 (AUTHOR), Maheshwari, Manas1 (AUTHOR), Faridi, Arman Rasool1 (AUTHOR), Shamsan, Ali Haider2 (AUTHOR) a.h.shamsan@ust.edu
Source: Scientific Reports. 10/8/2025, Vol. 15 Issue 1, p1-25. 25p.
Subject Terms: *LANGUAGE models, *PROJECT management software, *EVALUATION methodology
Abstract: Large Language Models (LLMs) offer promising capabilities for converting unstructured software documentation into structured task flows, yet their outputs often lack procedural reliability critical for software engineering. This paper presents a comprehensive framework that benchmarks five leading LLMs-Gemini 2.5 Pro, Grok 3, GPT-Omni, DeepSeek-R1, and LLaMA-3-across five prompting strategies, including Zero-Shot, Chain-of-Thought, and ISO 21502-Guided, using real-world software tutorials from the "Build Your Own X" repository. We introduce the Hybrid Semantic Similarity Metric (HSSM), which combines SentenceTransformer embeddings with context-aware key-term overlap, capturing both semantic fidelity and procedural coherence. Compared to traditional metrics like BERTScore, SBERT, and USE, HSSM demonstrates significantly lower variance (CV: 1.5–2.9%) and stronger correlation with human judgments. Our results show that even minimal prompting (Zero-Shot) can yield highly aligned task flows (HSSM: 96.33%) when evaluated with robust metrics. This work offers a scalable evaluation paradigm for LLM-assisted software planning, with implications for AI-driven project management, prompt engineering, and procedural generation in software education and tooling. [ABSTRACT FROM AUTHOR]
Database: Academic Search Index
Description
Abstract:Large Language Models (LLMs) offer promising capabilities for converting unstructured software documentation into structured task flows, yet their outputs often lack procedural reliability critical for software engineering. This paper presents a comprehensive framework that benchmarks five leading LLMs-Gemini 2.5 Pro, Grok 3, GPT-Omni, DeepSeek-R1, and LLaMA-3-across five prompting strategies, including Zero-Shot, Chain-of-Thought, and ISO 21502-Guided, using real-world software tutorials from the "Build Your Own X" repository. We introduce the Hybrid Semantic Similarity Metric (HSSM), which combines SentenceTransformer embeddings with context-aware key-term overlap, capturing both semantic fidelity and procedural coherence. Compared to traditional metrics like BERTScore, SBERT, and USE, HSSM demonstrates significantly lower variance (CV: 1.5–2.9%) and stronger correlation with human judgments. Our results show that even minimal prompting (Zero-Shot) can yield highly aligned task flows (HSSM: 96.33%) when evaluated with robust metrics. This work offers a scalable evaluation paradigm for LLM-assisted software planning, with implications for AI-driven project management, prompt engineering, and procedural generation in software education and tooling. [ABSTRACT FROM AUTHOR]
ISSN:20452322
DOI:10.1038/s41598-025-19170-9