Zobrazit v EDS

How Reliable Are GPT-4o and LLAMA3.3-70B in Classifying Natural Language Requirements? The Impact of the Temperature Setting

Uloženo v:

Podrobná bibliografie
Název:	How Reliable Are GPT-4o and LLAMA3.3-70B in Classifying Natural Language Requirements? The Impact of the Temperature Setting
Autoři:	Karlsson, Fredrik, 1974, Chatzipetrou, Panagiota, Assistant Professor, 1984, Gao, Shang, 1982, Havstorm, Tanja Elina, Assistant Professor, 1991
Zdroj:	IEEE Software. 42(6):97-104
Témata:	Software engineering, Predictive models, Accuracy, Transformers, Training, Natural languages, Temperature measurement, Software reliability, Natural language processing, Informatics, Informatik
Popis:	Classifying natural language requirements (NLRs) plays a crucial role in software engineering, helping us distinguish between functional and non-functional requirements. While large language models offer automation potential, we should address concerns about their consistency, meaning their ability to produce the same results over time. In this work, we share experiences from experimenting with how well GPT-4o and LLAMA3.3-70B classify NLRs using a zero-shot learning approach. Moreover, we explore how the temperature parameter influences classification performance and consistency for these models. Our results show that large language models like GPT-4o and LLAMA3.3- 70B can support automated NLRs classification. GPT-4o performs well in identifying functional requirements, with the highest consistency occurring at a temperature setting of one. Additionally, non-functional requirements classification improves at higher temperatures, indicating a trade-off between determinism and adaptability. LLAMA3.3-70B is more consistent than GPT-4o, and its classification accuracy varies less depending on temperature adjustments.
Popis souboru:	print
Přístupová URL adresa:	https://urn.kb.se/resolve?urn=urn:nbn:se:oru:diva-122267 https://doi.org/10.1109/MS.2025.3572561
Databáze:	SwePub

View record in SwePub

Full Text Finder

Nájsť tento článok vo Web of Science

Popis
Abstrakt:	Classifying natural language requirements (NLRs) plays a crucial role in software engineering, helping us distinguish between functional and non-functional requirements. While large language models offer automation potential, we should address concerns about their consistency, meaning their ability to produce the same results over time. In this work, we share experiences from experimenting with how well GPT-4o and LLAMA3.3-70B classify NLRs using a zero-shot learning approach. Moreover, we explore how the temperature parameter influences classification performance and consistency for these models. Our results show that large language models like GPT-4o and LLAMA3.3- 70B can support automated NLRs classification. GPT-4o performs well in identifying functional requirements, with the highest consistency occurring at a temperature setting of one. Additionally, non-functional requirements classification improves at higher temperatures, indicating a trade-off between determinism and adaptability. LLAMA3.3-70B is more consistent than GPT-4o, and its classification accuracy varies less depending on temperature adjustments.
ISSN:	07407459 19374194
DOI:	10.1109/MS.2025.3572561