Spatio-temporal masked autoencoder-based phonetic segments classification from ultrasound

The integration of Ultrasound Tongue Imaging (UTI) into clinical linguistics and phonetics research facilitates the examination of articulatory patterns and the correlation between speech sounds and their physical manifestations. This proves highly advantage for diagnosing speech disorders and impro...

Full description

Saved in:

Bibliographic Details
Published in:	Speech communication Vol. 169; p. 103186
Main Authors:	Dan, Xi, Xu, Kele, Zhou, Yihang, Yang, Chuanguang, Chen, Yihao, Dou, Yutao, Yang, Cheng
Format:	Journal Article
Language:	English
Published:	Elsevier B.V 01.04.2025
Subjects:	Self-supervised learning Spatio-temporal masking strategy Ultrasound tongue imaging Spatio-temporal masking strategy Self-supervised learning Ultrasound tongue imaging
ISSN:	0167-6393
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	The integration of Ultrasound Tongue Imaging (UTI) into clinical linguistics and phonetics research facilitates the examination of articulatory patterns and the correlation between speech sounds and their physical manifestations. This proves highly advantage for diagnosing speech disorders and improving the study for speech production and silent speech recognition. In recent years, self-supervised learning (SSL) has gathered attention as a cost-effective approach for analyzing UTI data. However, it is noteworthy that most existing SSL models often do not fully exploit the contextual information embedded within UTI sequences. To tackle this challenge, we present a novel SSL framework for UTI classification that capitalizes on both the pre-training and fine-tuning phases. Specifically, we propose spatio-temporal masking to harness contextual information during pre-training, thus reducing the need for human annotation. Besides, we insert token shift module into the encoder to enhance the model representation of the spatio-temporal features of tongue movements in UTI sequences. Additionally, to imitate the decision path of the domain experts, we apply hard example mining techniques during fine-tuning to augment the performance of the model. The experimental results on a publicly available dataset demonstrate that our proposed method outperforms other competitive methods in UTI classification tasks, which underscores the potential of our approach to enhance the analysis and interpretation of UTI data. Our code is available at https://github.com/colaudiolab/USenhance.git. •We propose a spatio-temporal masking strategy for SSL in pre-training, tailored to effectively leverage contextual information within UTI sequences. This innovative approach involves selectively masking visual pixel blocks positioned identically across multiple frames.•We introduce a significant enhancement, Token Shift Module to further enrich the representation of spatio-temporal features specific to tongue movements within ultrasound image sequences.•Extensive experiments conducted on UXTD demonstrate the effectiveness of our proposed method.
ISSN:	0167-6393
DOI:	10.1016/j.specom.2025.103186