How Reliable Are GPT-4o and LLAMA3.3-70B in Classifying Natural Language Requirements? The Impact of the Temperature Setting

Uloženo v:
Podrobná bibliografie
Název: How Reliable Are GPT-4o and LLAMA3.3-70B in Classifying Natural Language Requirements? The Impact of the Temperature Setting
Autoři: Karlsson, Fredrik, 1974, Chatzipetrou, Panagiota, Assistant Professor, 1984, Gao, Shang, 1982, Havstorm, Tanja Elina, Assistant Professor, 1991
Zdroj: IEEE Software. 42(6):97-104
Témata: Software engineering, Predictive models, Accuracy, Transformers, Training, Natural languages, Temperature measurement, Software reliability, Natural language processing, Informatics, Informatik
Popis: Classifying natural language requirements (NLRs) plays a crucial role in software engineering, helping us distinguish between functional and non-functional requirements. While large language models offer automation potential, we should address concerns about their consistency, meaning their ability to produce the same results over time. In this work, we share experiences from experimenting with how well GPT-4o and LLAMA3.3-70B classify NLRs using a zero-shot learning approach. Moreover, we explore how the temperature parameter influences classification performance and consistency for these models. Our results show that large language models like GPT-4o and LLAMA3.3- 70B can support automated NLRs classification. GPT-4o performs well in identifying functional requirements, with the highest consistency occurring at a temperature setting of one. Additionally, non-functional requirements classification improves at higher temperatures, indicating a trade-off between determinism and adaptability. LLAMA3.3-70B is more consistent than GPT-4o, and its classification accuracy varies less depending on temperature adjustments.
Popis souboru: print
Přístupová URL adresa: https://urn.kb.se/resolve?urn=urn:nbn:se:oru:diva-122267
https://doi.org/10.1109/MS.2025.3572561
Databáze: SwePub
Popis
Abstrakt:Classifying natural language requirements (NLRs) plays a crucial role in software engineering, helping us distinguish between functional and non-functional requirements. While large language models offer automation potential, we should address concerns about their consistency, meaning their ability to produce the same results over time. In this work, we share experiences from experimenting with how well GPT-4o and LLAMA3.3-70B classify NLRs using a zero-shot learning approach. Moreover, we explore how the temperature parameter influences classification performance and consistency for these models. Our results show that large language models like GPT-4o and LLAMA3.3- 70B can support automated NLRs classification. GPT-4o performs well in identifying functional requirements, with the highest consistency occurring at a temperature setting of one. Additionally, non-functional requirements classification improves at higher temperatures, indicating a trade-off between determinism and adaptability. LLAMA3.3-70B is more consistent than GPT-4o, and its classification accuracy varies less depending on temperature adjustments.
ISSN:07407459
19374194
DOI:10.1109/MS.2025.3572561