Multimodal Fusion for Thai Sign Language Recognition: Integrating RGB-Based CNN and Landmark-Based Features for Enhanced Gesture Recognition

This paper introduces a multimodal fusion model designed to improve the recognition of Thai Sign Language (TSL) gestures by combining RGB-based spatial features with landmark-based skeletal information. The proposed model employs ResNet-50, a deep Convolutional Neural Network (CNN) pre-trained on Im...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	2025 13th International Electrical Engineering Congress (iEECON) s. 1 - 5
Hlavní autoři:	Vijitkunsawat, Wuttichai, Sopin, Anan, Sathusen, Anusorn
Médium:	Konferenční příspěvek
Jazyk:	angličtina
Vydáno:	IEEE 05.03.2025
Témata:	Accuracy Convolutional neural networks Deep learning deep learning models Electrical engineering Feature extraction Hands landmark-based Libraries multi-modal RGB-based Robustness Sign language sign language recognition Visualization
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Popis
Shrnutí:	This paper introduces a multimodal fusion model designed to improve the recognition of Thai Sign Language (TSL) gestures by combining RGB-based spatial features with landmark-based skeletal information. The proposed model employs ResNet-50, a deep Convolutional Neural Network (CNN) pre-trained on ImageNet, to extract detailed spatial features from RGB images, capturing the visual characteristics of hand gestures. In parallel, Google's MediaPipe library is used to obtain 2D hand landmarks, providing a coordinate representation of hand structures through x, y coordinate data. These RGB and landmark-based features are then fused to create a comprehensive representation that effectively captures both the appearance and structural details of gestures. The model's performance was rigorously evaluated using a TSL dataset, achieving an accuracy of 94.6%, precision of 0.937, recall of 0.921, and an F1-score of 0.929, significantly outperforming traditional machine learning models and standalone CNN architectures: VGG-16 and ResNet-50 alone. This study highlights the advantages of integrating spatial and skeletal features to enhance accuracy and robustness, especially in applications requiring precise recognition of complex hand gestures under varied conditions.
DOI:	10.1109/iEECON64081.2025.10987852