A novel human action recognition using Grad-CAM visualization with gated recurrent units

Human action recognition is a vital aspect of computer vision, with applications ranging from security systems to interactive technology. Our study presents a comprehensive methodology that employs multiple feature extraction and optimization techniques to enhance the accuracy and efficiency of huma...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:Neural computing & applications Ročník 37; číslo 17; s. 10835 - 10850
Hlavní autori: Jayamohan, M., Yuvaraj, S.
Médium: Journal Article
Jazyk:English
Vydavateľské údaje: London Springer London 01.06.2025
Springer Nature B.V
Predmet:
ISSN:0941-0643, 1433-3058
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Popis
Shrnutí:Human action recognition is a vital aspect of computer vision, with applications ranging from security systems to interactive technology. Our study presents a comprehensive methodology that employs multiple feature extraction and optimization techniques to enhance the accuracy and efficiency of human action identification. The video input was divided into four distinct elements: RGB images, optical flow information, spatial saliency maps, and temporal saliency maps. Each component was analyzed independently using advanced computer vision algorithms. The process involves utilizing various algorithms and techniques to extract meaningful information from the visual data. The Farneback algorithm was employed to examine the optical flow, whereas Canny edge detection was used to assess spatial prominence. Additionally, frame comparison helps to identify motion-based prominence. These processed elements provide a comprehensive representation of both spatial and temporal information. The extracted data were then input into distinct pretrained deep learning models. Specifically, Inception V 3 was used for RGB frames and optical flow analysis, ResNetV2 processed spatial saliency maps, and DenseNet-121 handled motion saliency maps. The input data are processed separately by these networks, each of which extracts specific features that are suited to their respective modalities. This feature extraction process ensures the comprehensive capture of both static and dynamic elements in video data. Subsequently, sequence modeling and classification were performed using a gated recurrent unit (GRU) that incorporated an attention mechanism. This mechanism dynamically highlights the most significant temporal segments, improving the capacity of the model to comprehend intricate human actions within video sequences. To enhance the efficiency of the model, we implemented the Grasshopper optimization algorithm to optimize the feature selection and classification stages, thus maximizing the use of extracted features. We evaluated our approach using two standard datasets, UCF101 and HMDB51. The method demonstrated its efficacy in identifying various human actions, achieving 98.35% accuracy on UCF101 and 83.45% accuracy on HMDB51. The Grad-CAM visualization technique reveals key areas the model focuses on for action recognition. This study underscores the effectiveness of integrating multimodal feature extraction, deep learning, and optimization for precise and interpretable human action recognition. The proposed method excels across diverse complex datasets, offering a practical solution for real-world applications like automated surveillance, human–computer interfaces, and activity monitoring platforms.
Bibliografia:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:0941-0643
1433-3058
DOI:10.1007/s00521-025-10978-0