Spatio-temporal masked autoencoder-based phonetic segments classification from ultrasound
The integration of Ultrasound Tongue Imaging (UTI) into clinical linguistics and phonetics research facilitates the examination of articulatory patterns and the correlation between speech sounds and their physical manifestations. This proves highly advantage for diagnosing speech disorders and impro...
Saved in:
| Published in: | Speech communication Vol. 169; p. 103186 |
|---|---|
| Main Authors: | , , , , , , |
| Format: | Journal Article |
| Language: | English |
| Published: |
Elsevier B.V
01.04.2025
|
| Subjects: | |
| ISSN: | 0167-6393 |
| Online Access: | Get full text |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | The integration of Ultrasound Tongue Imaging (UTI) into clinical linguistics and phonetics research facilitates the examination of articulatory patterns and the correlation between speech sounds and their physical manifestations. This proves highly advantage for diagnosing speech disorders and improving the study for speech production and silent speech recognition. In recent years, self-supervised learning (SSL) has gathered attention as a cost-effective approach for analyzing UTI data. However, it is noteworthy that most existing SSL models often do not fully exploit the contextual information embedded within UTI sequences. To tackle this challenge, we present a novel SSL framework for UTI classification that capitalizes on both the pre-training and fine-tuning phases. Specifically, we propose spatio-temporal masking to harness contextual information during pre-training, thus reducing the need for human annotation. Besides, we insert token shift module into the encoder to enhance the model representation of the spatio-temporal features of tongue movements in UTI sequences. Additionally, to imitate the decision path of the domain experts, we apply hard example mining techniques during fine-tuning to augment the performance of the model. The experimental results on a publicly available dataset demonstrate that our proposed method outperforms other competitive methods in UTI classification tasks, which underscores the potential of our approach to enhance the analysis and interpretation of UTI data. Our code is available at https://github.com/colaudiolab/USenhance.git.
•We propose a spatio-temporal masking strategy for SSL in pre-training, tailored to effectively leverage contextual information within UTI sequences. This innovative approach involves selectively masking visual pixel blocks positioned identically across multiple frames.•We introduce a significant enhancement, Token Shift Module to further enrich the representation of spatio-temporal features specific to tongue movements within ultrasound image sequences.•Extensive experiments conducted on UXTD demonstrate the effectiveness of our proposed method. |
|---|---|
| ISSN: | 0167-6393 |
| DOI: | 10.1016/j.specom.2025.103186 |