Automated Testing for Service-Oriented Architecture: Leveraging Large Language Models for Enhanced Service Composition

This article explores the application of Large Language Models (LLMs), including proprietary models such as OpenAI's ChatGPT 4o and ChatGPT 4o-mini, Anthropic's Claude 3.5 Sonnet and Claude 3.7 Sonnet, and Google's Gemini 1.5 Pro, Gemini 2.0 Flash, and Gemini 2.0 Flash-Lite, as well a...

Full description

Saved in:
Bibliographic Details
Published in:IEEE access Vol. 13; pp. 89627 - 89640
Main Authors: Altin, Mahsun, Mutlu, Behcet, Kilinc, Deniz, Cakir, Altan
Format: Journal Article
Language:English
Published: Piscataway IEEE 2025
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects:
ISSN:2169-3536, 2169-3536
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:This article explores the application of Large Language Models (LLMs), including proprietary models such as OpenAI's ChatGPT 4o and ChatGPT 4o-mini, Anthropic's Claude 3.5 Sonnet and Claude 3.7 Sonnet, and Google's Gemini 1.5 Pro, Gemini 2.0 Flash, and Gemini 2.0 Flash-Lite, as well as open-source alternatives including Qwen2.5-14B-Instruct-1M, and commercially accessed models such as DeepSeek R1 and DeepSeek V3, which were tested via APIs despite having open-source variants, to automate validation and verification in Application Programming Interface (API) testing within a Service-Oriented Architecture (SOA). Our system compares internal responses from the Enuygun Web Server against third-party API outputs in both JSON and XML formats, validating critical parameters such as flight prices, baggage allowances, and seat availability. We generated 100 diverse test scenarios across varying complexities (1-4 flight results) by randomly altering request and response parameters. Experimental results show that Google Gemini 2.0 Flash achieved high accuracy (up to 99.98%) with the lowest completion time (85.34 seconds), while Qwen2.5-14B-Instruct-1M exhibited limited capability in processing complex formats. Models such as OpenAI's ChatGPT and Anthropic's Claude Sonnet models also demonstrated strong performance in single-flight validation scenarios, making them suitable for low-latency, high-precision tasks. Our findings indicate that some open-source models can offer promising cost-effective alternatives, though performance significantly varies. This integration of LLMs reduced manual workload, improved test scalability, and enabled real-time validation across large-scale datasets. As LLM technologies mature, we anticipate further advances in automation, accuracy, and efficiency in software validation systems.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:2169-3536
2169-3536
DOI:10.1109/ACCESS.2025.3571994