Convolutional Variational Autoencoders for Spectrogram Compression in Automatic Speech Recognition

For many Automatic Speech Recognition (ASR) tasks audio features as spectrograms show better results than Mel-frequency Cepstral Coefficients (MFCC), but in practice they are hard to use due to a complex dimensionality of a feature space. The following paper presents an alternative approach towards...

Full description

Saved in:
Bibliographic Details
Published in:arXiv.org
Main Authors: Iakovenko, Olga, Bondarenko, Ivan
Format: Paper
Language:English
Published: Ithaca Cornell University Library, arXiv.org 04.10.2024
Subjects:
ISSN:2331-8422
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:For many Automatic Speech Recognition (ASR) tasks audio features as spectrograms show better results than Mel-frequency Cepstral Coefficients (MFCC), but in practice they are hard to use due to a complex dimensionality of a feature space. The following paper presents an alternative approach towards generating compressed spectrogram representation, based on Convolutional Variational Autoencoders (VAE). A Convolutional VAE model was trained on a subsample of the LibriSpeech dataset to reconstruct short fragments of audio spectrograms (25 ms) from a 13-dimensional embedding. The trained model for a 40-dimensional (300 ms) embedding was used to generate features for corpus of spoken commands on the GoogleSpeechCommands dataset. Using the generated features an ASR system was built and compared to the model with MFCC features.
Bibliography:SourceType-Working Papers-1
ObjectType-Working Paper/Pre-Print-1
content type line 50
ISSN:2331-8422
DOI:10.48550/arxiv.2410.02560