A Secure and Robust Multimodal Framework for In-Vehicle Voice Control: Integrating Bilingual Wake-Up, Speaker Verification, and Fuzzy Command Understanding

Intelligent in-vehicle voice systems face critical challenges in robustness, security, and semantic flexibility under complex acoustic conditions. To address these issues holistically, this paper proposes a novel multimodal and secure voice-control framework. The system integrates a hybrid dual-chan...

Full description

Saved in:
Bibliographic Details
Published in:Eng (Basel, Switzerland) Vol. 6; no. 11; p. 319
Main Authors: Zhang, Zhixiong, Li, Yao, Ren, Wen, Wang, Xiaoyan
Format: Journal Article
Language:English
Published: Basel MDPI AG 01.11.2025
Subjects:
ISSN:2673-4117, 2673-4117
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Intelligent in-vehicle voice systems face critical challenges in robustness, security, and semantic flexibility under complex acoustic conditions. To address these issues holistically, this paper proposes a novel multimodal and secure voice-control framework. The system integrates a hybrid dual-channel wake-up mechanism, combining a commercial English engine (Picovoice) with a custom lightweight ResNet-Lite model for Chinese, to achieve robust cross-lingual activation. For reliable identity authentication, an optimized ECAPA-TDNN model is introduced, enhanced with spectral augmentation, sliding window feature fusion, and an adaptive threshold mechanism. Furthermore, a two-tier fuzzy command matching algorithm operating at character and pinyin levels is designed to significantly improve tolerance to speech variations and ASR errors. Comprehensive experiments on a test set encompassing various Chinese dialects, English accents, and noise environments demonstrate that the proposed system achieves high performance across all components: the wake-up mechanism maintains commercial-grade reliability for English and provides a functional baseline for Chinese; the improved ECAPA-TDNN attains low equal error rates of 2.37% (quiet), 5.59% (background music), and 3.12% (high-speed noise), outperforming standard baselines and showing strong noise robustness against the state of the art; and the fuzzy matcher boosts command recognition accuracy to over 95.67% in quiet environments and above 92.7% under noise, substantially outperforming hard matching by approximately 30%. End-to-end tests confirm an overall interaction success rate of 93.7%. This work offers a practical, integrated solution for developing secure, robust, and flexible voice interfaces in intelligent vehicles.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:2673-4117
2673-4117
DOI:10.3390/eng6110319