Feature learning of Japanese pitch accents and applications to Japanese speech education

We modeled pitch frequency by a sequential variational autoencoder to obtain the feature representations of the pitch accents of Japanese words for applications to Japanese speech education for the hearing impaired and Japanese-language learners. In our model, the latent variables are comprised of t...

Celý popis

Uložené v:

Podrobná bibliografia
Vydané v:	2023 14th IIAI International Congress on Advanced Applied Informatics (IIAI-AAI) s. 188 - 193
Hlavný autor:	Masuda-Katsuse, Ikuyo
Médium:	Konferenčný príspevok..
Jazyk:	English Japanese
Vydavateľské údaje:	IEEE 08.07.2023
Predmet:	Auditory system deep learning Education Feeds Focusing Japanese speech education latent variables Linguistics pitch accent Representation learning speech perception Time-frequency analysis VAE
On-line prístup:	Získať plný text
Tagy:	Pridať tag Žiadne tagy, Buďte prvý, kto otaguje tento záznam!

Popis
Shrnutí:	We modeled pitch frequency by a sequential variational autoencoder to obtain the feature representations of the pitch accents of Japanese words for applications to Japanese speech education for the hearing impaired and Japanese-language learners. In our model, the latent variables are comprised of two types. One represents time-invariant features of pitch accent types and the other represents time-variant features of voiced/unvoiced segments. We approximated the distribution of the time-invariant latent variables by a Gaussian mixture model and estimated the accent type of the test data to confirm that they represented the features of the accent types. Next by varying only the value of the time-invariant latent variables, we resynthesized 49 different pitch patterns per word and generated speech that transformed the pitch frequency of the original speech into such pitch patterns. Seven subjects rated the adequacy of the pitch patterns for words. We found that the distribution of the subjects' rating averages tended to extend to accent types other than the annotated accent types compared to the distribution of accent features represented in time-invariant latent space.
DOI:	10.1109/IIAI-AAI59060.2023.00047