How Reliable Are GPT-4o and LLAMA3.3-70B in Classifying Natural Language Requirements? The Impact of the Temperature Setting
Uloženo v:
| Název: | How Reliable Are GPT-4o and LLAMA3.3-70B in Classifying Natural Language Requirements? The Impact of the Temperature Setting |
|---|---|
| Autoři: | Karlsson, Fredrik, 1974, Chatzipetrou, Panagiota, Assistant Professor, 1984, Gao, Shang, 1982, Havstorm, Tanja Elina, Assistant Professor, 1991 |
| Zdroj: | IEEE Software. 42(6):97-104 |
| Témata: | Software engineering, Predictive models, Accuracy, Transformers, Training, Natural languages, Temperature measurement, Software reliability, Natural language processing, Informatics, Informatik |
| Popis: | Classifying natural language requirements (NLRs) plays a crucial role in software engineering, helping us distinguish between functional and non-functional requirements. While large language models offer automation potential, we should address concerns about their consistency, meaning their ability to produce the same results over time. In this work, we share experiences from experimenting with how well GPT-4o and LLAMA3.3-70B classify NLRs using a zero-shot learning approach. Moreover, we explore how the temperature parameter influences classification performance and consistency for these models. Our results show that large language models like GPT-4o and LLAMA3.3- 70B can support automated NLRs classification. GPT-4o performs well in identifying functional requirements, with the highest consistency occurring at a temperature setting of one. Additionally, non-functional requirements classification improves at higher temperatures, indicating a trade-off between determinism and adaptability. LLAMA3.3-70B is more consistent than GPT-4o, and its classification accuracy varies less depending on temperature adjustments. |
| Popis souboru: | |
| Přístupová URL adresa: | https://urn.kb.se/resolve?urn=urn:nbn:se:oru:diva-122267 https://doi.org/10.1109/MS.2025.3572561 |
| Databáze: | SwePub |
| Abstrakt: | Classifying natural language requirements (NLRs) plays a crucial role in software engineering, helping us distinguish between functional and non-functional requirements. While large language models offer automation potential, we should address concerns about their consistency, meaning their ability to produce the same results over time. In this work, we share experiences from experimenting with how well GPT-4o and LLAMA3.3-70B classify NLRs using a zero-shot learning approach. Moreover, we explore how the temperature parameter influences classification performance and consistency for these models. Our results show that large language models like GPT-4o and LLAMA3.3- 70B can support automated NLRs classification. GPT-4o performs well in identifying functional requirements, with the highest consistency occurring at a temperature setting of one. Additionally, non-functional requirements classification improves at higher temperatures, indicating a trade-off between determinism and adaptability. LLAMA3.3-70B is more consistent than GPT-4o, and its classification accuracy varies less depending on temperature adjustments. |
|---|---|
| ISSN: | 07407459 19374194 |
| DOI: | 10.1109/MS.2025.3572561 |
Full Text Finder
Nájsť tento článok vo Web of Science