Feature learning of Japanese pitch accents and applications to Japanese speech education

We modeled pitch frequency by a sequential variational autoencoder to obtain the feature representations of the pitch accents of Japanese words for applications to Japanese speech education for the hearing impaired and Japanese-language learners. In our model, the latent variables are comprised of t...

Full description

Saved in:

Bibliographic Details
Published in:	2023 14th IIAI International Congress on Advanced Applied Informatics (IIAI-AAI) pp. 188 - 193
Main Author:	Masuda-Katsuse, Ikuyo
Format:	Conference Proceeding
Language:	English Japanese
Published:	IEEE 08.07.2023
Subjects:	Auditory system deep learning Education Feeds Focusing Japanese speech education latent variables Linguistics pitch accent Representation learning speech perception Time-frequency analysis VAE
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	We modeled pitch frequency by a sequential variational autoencoder to obtain the feature representations of the pitch accents of Japanese words for applications to Japanese speech education for the hearing impaired and Japanese-language learners. In our model, the latent variables are comprised of two types. One represents time-invariant features of pitch accent types and the other represents time-variant features of voiced/unvoiced segments. We approximated the distribution of the time-invariant latent variables by a Gaussian mixture model and estimated the accent type of the test data to confirm that they represented the features of the accent types. Next by varying only the value of the time-invariant latent variables, we resynthesized 49 different pitch patterns per word and generated speech that transformed the pitch frequency of the original speech into such pitch patterns. Seven subjects rated the adequacy of the pitch patterns for words. We found that the distribution of the subjects' rating averages tended to extend to accent types other than the annotated accent types compared to the distribution of accent features represented in time-invariant latent space.
DOI:	10.1109/IIAI-AAI59060.2023.00047