Intelligent speech recognition algorithm in multimedia visual interaction via BiLSTM and attention mechanism

With the rapid development of information technology in modern society, the application of multimedia integration platform is more and more extensive. Speech recognition has become an important subject in the process of multimedia visual interaction. The accuracy of speech recognition is dependent o...

Full description

Saved in:

Bibliographic Details
Published in:	Neural computing & applications Vol. 36; no. 5; pp. 2371 - 2383
Main Author:	Feng, Yican
Format:	Journal Article
Language:	English
Published:	London Springer London 01.02.2024 Springer Nature B.V
Subjects:	Algorithms Artificial Intelligence Computational Biology/Bioinformatics Computational Science and Engineering Computer Science Data Mining and Knowledge Discovery Image Processing and Computer Vision Multimedia Neural networks Performance enhancement Probability and Statistics in Computer Science S.I.: Machine Learning and Big Data Analytics for IoT Security and Privacy (SPIoT 2022) Special Issue on Machine Learning and Big Data Analytics for IoT Security and Privacy (SPIoT 2022) Speech Speech recognition Voice recognition Multimedia visual interaction BiLSTM Speech recognition Attention
ISSN:	0941-0643, 1433-3058
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	With the rapid development of information technology in modern society, the application of multimedia integration platform is more and more extensive. Speech recognition has become an important subject in the process of multimedia visual interaction. The accuracy of speech recognition is dependent on a number of elements, two of which are the acoustic characteristics of speech and the speech recognition model. Speech data is complex and changeable. Most methods only extract a single type of feature of the signal to represent the speech signal. This single feature cannot express the hidden information. And, the excellent speech recognition model can also better learn the characteristic speech information to improve performance. This work proposes a new method for speech recognition in multimedia visual interaction. First of all, this work considers the problem that a single feature cannot fully represent complex speech information. This paper proposes three kinds of feature fusion structures to extract speech information from different angles. This extracts three different fusion features based on the low-level features and higher-level sparse representation. Secondly, this work relies on the strong learning ability of neural network and the weight distribution mechanism of attention model. In this paper, the fusion feature is combined with the bidirectional long and short memory network with attention. The extracted fusion features contain more speech information with strong discrimination. When the weight increases, it can further improve the influence of features on the predicted value and improve the performance. Finally, this paper has carried out systematic experiments on the proposed method, and the results verify the feasibility.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	0941-0643 1433-3058
DOI:	10.1007/s00521-023-08959-2