View in EDS

Generating reliable software project task flows using large language models through prompt engineering and robust evaluation.

Saved in:

Bibliographic Details
Title:	Generating reliable software project task flows using large language models through prompt engineering and robust evaluation.
Authors:	Sarim, Mohammed¹ (AUTHOR), Masood, Faraz¹ (AUTHOR), Maheshwari, Manas¹ (AUTHOR), Faridi, Arman Rasool¹ (AUTHOR), Shamsan, Ali Haider² (AUTHOR) a.h.shamsan@ust.edu
Source:	Scientific Reports. 10/8/2025, Vol. 15 Issue 1, p1-25. 25p.
Subject Terms:	LANGUAGE models, PROJECT management software, *EVALUATION methodology
Abstract:	Large Language Models (LLMs) offer promising capabilities for converting unstructured software documentation into structured task flows, yet their outputs often lack procedural reliability critical for software engineering. This paper presents a comprehensive framework that benchmarks five leading LLMs-Gemini 2.5 Pro, Grok 3, GPT-Omni, DeepSeek-R1, and LLaMA-3-across five prompting strategies, including Zero-Shot, Chain-of-Thought, and ISO 21502-Guided, using real-world software tutorials from the "Build Your Own X" repository. We introduce the Hybrid Semantic Similarity Metric (HSSM), which combines SentenceTransformer embeddings with context-aware key-term overlap, capturing both semantic fidelity and procedural coherence. Compared to traditional metrics like BERTScore, SBERT, and USE, HSSM demonstrates significantly lower variance (CV: 1.5–2.9%) and stronger correlation with human judgments. Our results show that even minimal prompting (Zero-Shot) can yield highly aligned task flows (HSSM: 96.33%) when evaluated with robust metrics. This work offers a scalable evaluation paradigm for LLM-assisted software planning, with implications for AI-driven project management, prompt engineering, and procedural generation in software education and tooling. [ABSTRACT FROM AUTHOR]
Database:	Academic Search Index

Full Text Finder

Nájsť tento článok vo Web of Science

Description
Abstract:	Large Language Models (LLMs) offer promising capabilities for converting unstructured software documentation into structured task flows, yet their outputs often lack procedural reliability critical for software engineering. This paper presents a comprehensive framework that benchmarks five leading LLMs-Gemini 2.5 Pro, Grok 3, GPT-Omni, DeepSeek-R1, and LLaMA-3-across five prompting strategies, including Zero-Shot, Chain-of-Thought, and ISO 21502-Guided, using real-world software tutorials from the "Build Your Own X" repository. We introduce the Hybrid Semantic Similarity Metric (HSSM), which combines SentenceTransformer embeddings with context-aware key-term overlap, capturing both semantic fidelity and procedural coherence. Compared to traditional metrics like BERTScore, SBERT, and USE, HSSM demonstrates significantly lower variance (CV: 1.5–2.9%) and stronger correlation with human judgments. Our results show that even minimal prompting (Zero-Shot) can yield highly aligned task flows (HSSM: 96.33%) when evaluated with robust metrics. This work offers a scalable evaluation paradigm for LLM-assisted software planning, with implications for AI-driven project management, prompt engineering, and procedural generation in software education and tooling. [ABSTRACT FROM AUTHOR]
ISSN:	20452322
DOI:	10.1038/s41598-025-19170-9