Convolutional Variational Autoencoders for Spectrogram Compression in Automatic Speech Recognition

For many Automatic Speech Recognition (ASR) tasks audio features as spectrograms show better results than Mel-frequency Cepstral Coefficients (MFCC), but in practice they are hard to use due to a complex dimensionality of a feature space. The following paper presents an alternative approach towards...

Full description

Saved in:
Bibliographic Details
Published in:arXiv.org
Main Authors: Iakovenko, Olga, Bondarenko, Ivan
Format: Paper
Language:English
Published: Ithaca Cornell University Library, arXiv.org 04.10.2024
Subjects:
ISSN:2331-8422
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract For many Automatic Speech Recognition (ASR) tasks audio features as spectrograms show better results than Mel-frequency Cepstral Coefficients (MFCC), but in practice they are hard to use due to a complex dimensionality of a feature space. The following paper presents an alternative approach towards generating compressed spectrogram representation, based on Convolutional Variational Autoencoders (VAE). A Convolutional VAE model was trained on a subsample of the LibriSpeech dataset to reconstruct short fragments of audio spectrograms (25 ms) from a 13-dimensional embedding. The trained model for a 40-dimensional (300 ms) embedding was used to generate features for corpus of spoken commands on the GoogleSpeechCommands dataset. Using the generated features an ASR system was built and compared to the model with MFCC features.
AbstractList For many Automatic Speech Recognition (ASR) tasks audio features as spectrograms show better results than Mel-frequency Cepstral Coefficients (MFCC), but in practice they are hard to use due to a complex dimensionality of a feature space. The following paper presents an alternative approach towards generating compressed spectrogram representation, based on Convolutional Variational Autoencoders (VAE). A Convolutional VAE model was trained on a subsample of the LibriSpeech dataset to reconstruct short fragments of audio spectrograms (25 ms) from a 13-dimensional embedding. The trained model for a 40-dimensional (300 ms) embedding was used to generate features for corpus of spoken commands on the GoogleSpeechCommands dataset. Using the generated features an ASR system was built and compared to the model with MFCC features.
Author Bondarenko, Ivan
Iakovenko, Olga
Author_xml – sequence: 1
  givenname: Olga
  surname: Iakovenko
  fullname: Iakovenko, Olga
– sequence: 2
  givenname: Ivan
  surname: Bondarenko
  fullname: Bondarenko, Ivan
BookMark eNotj01Lw0AYhBdRsNb-AG8Bz6nvvpvdNMcS_IKCoMVr2c-akuzG3aT48021pxmGZwbmhlz64C0hdxSWxYpzeJDxpzkusZgCQC7ggsyQMZqvCsRrskjpAAAoSuSczYiqgz-Gdhya4GWbfcrYyLNfj0OwXgdjY8pciNlHb_UQwz7KLqtD10eb0oRmjf9ju6moT5DVX9m71WHvm9PULblysk12cdY52T49buuXfPP2_FqvN7nkSHPkSjjBUIEyrKKKucq4AowsgVaSV2hcyaQBKh1q7kSpobAMyqkrFGrF5uT-f7aP4Xu0adgdwhinI2nHKMVKlKKi7Bfmj1os
ContentType Paper
Copyright 2024. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Copyright_xml – notice: 2024. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
DBID 8FE
8FG
ABJCF
ABUWG
AFKRA
AZQEC
BENPR
BGLVJ
CCPQU
DWQXO
HCIFZ
L6V
M7S
PHGZM
PHGZT
PIMPY
PKEHL
PQEST
PQGLB
PQQKQ
PQUKI
PRINS
PTHSS
DOI 10.48550/arxiv.2410.02560
DatabaseName ProQuest SciTech Collection
ProQuest Technology Collection
Materials Science & Engineering Collection
ProQuest Central (Alumni)
ProQuest Central UK/Ireland
ProQuest Central Essentials
ProQuest Central
Technology Collection
ProQuest One Community College
ProQuest Central
SciTech Premium Collection
ProQuest Engineering Collection
Engineering Database
ProQuest Central Premium
ProQuest One Academic
Publicly Available Content Database
ProQuest One Academic Middle East (New)
ProQuest One Academic Eastern Edition (DO NOT USE)
One Applied & Life Sciences
ProQuest One Academic (retired)
ProQuest One Academic UKI Edition
ProQuest Central China
Engineering Collection (ProQuest)
DatabaseTitle Publicly Available Content Database
Engineering Database
Technology Collection
ProQuest One Academic Middle East (New)
ProQuest Central Essentials
ProQuest One Academic Eastern Edition
ProQuest Central (Alumni Edition)
SciTech Premium Collection
ProQuest One Community College
ProQuest Technology Collection
ProQuest SciTech Collection
ProQuest Central China
ProQuest Central
ProQuest One Applied & Life Sciences
ProQuest Engineering Collection
ProQuest One Academic UKI Edition
ProQuest Central Korea
Materials Science & Engineering Collection
ProQuest Central (New)
ProQuest One Academic
ProQuest One Academic (New)
Engineering Collection
DatabaseTitleList Publicly Available Content Database
Database_xml – sequence: 1
  dbid: PIMPY
  name: Publicly Available Content Database
  url: http://search.proquest.com/publiccontent
  sourceTypes: Aggregation Database
DeliveryMethod fulltext_linktorsrc
Discipline Physics
EISSN 2331-8422
Genre Working Paper/Pre-Print
GroupedDBID 8FE
8FG
ABJCF
ABUWG
AFKRA
ALMA_UNASSIGNED_HOLDINGS
AZQEC
BENPR
BGLVJ
CCPQU
DWQXO
FRJ
HCIFZ
L6V
M7S
M~E
PHGZM
PHGZT
PIMPY
PKEHL
PQEST
PQGLB
PQQKQ
PQUKI
PRINS
PTHSS
ID FETCH-LOGICAL-a521-25b6f632b0bd391b3f9df40da7019a592df73ad01af2c5f67c04e3075216b2cb3
IEDL.DBID M7S
IngestDate Mon Jun 30 09:21:40 EDT 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a521-25b6f632b0bd391b3f9df40da7019a592df73ad01af2c5f67c04e3075216b2cb3
Notes SourceType-Working Papers-1
ObjectType-Working Paper/Pre-Print-1
content type line 50
OpenAccessLink https://www.proquest.com/docview/3112967691?pq-origsite=%requestingapplication%
PQID 3112967691
PQPubID 2050157
ParticipantIDs proquest_journals_3112967691
PublicationCentury 2000
PublicationDate 20241004
PublicationDateYYYYMMDD 2024-10-04
PublicationDate_xml – month: 10
  year: 2024
  text: 20241004
  day: 04
PublicationDecade 2020
PublicationPlace Ithaca
PublicationPlace_xml – name: Ithaca
PublicationTitle arXiv.org
PublicationYear 2024
Publisher Cornell University Library, arXiv.org
Publisher_xml – name: Cornell University Library, arXiv.org
SSID ssj0002672553
Score 1.8848197
SecondaryResourceType preprint
Snippet For many Automatic Speech Recognition (ASR) tasks audio features as spectrograms show better results than Mel-frequency Cepstral Coefficients (MFCC), but in...
SourceID proquest
SourceType Aggregation Database
SubjectTerms Automatic speech recognition
Datasets
Embedding
Spectrograms
Task complexity
Title Convolutional Variational Autoencoders for Spectrogram Compression in Automatic Speech Recognition
URI https://www.proquest.com/docview/3112967691
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV3PS8MwFA66KXjyN_6YIwev3dKkTdqT6HDowVHmkHka-YkDaWfbDf98k6zTg-DFY2kK5TV973svH98HwDXBytZ1IwJpBAqi1LCAW1QRGK1oQlOGKRfebIKNRsl0mmbNwK1qaJWbnOgTtSqkm5H3iQMGjo8Z3iw-Auca5U5XGwuNbdB2Kgmhp-49f89YMGUWMZP1YaaX7urz8nO-6tmyhXq-2v9Kwb6uDPf_-0YHoJ3xhS4PwZbOj8Cu53PK6hiIQZGvml3F3-GL7YibqR-8XdaFU690DGZoISt0FvT1mqYFXXpYM2NzOM_9Wq_p6hZp-QbHG75RkZ-AyfB-MngIGjsFG30cBjgW1FCCBRKKpKEgJlUmQoo7QXYep1gZRrhCITdYxoYyiSJtM4B9lgosBTkFrbzI9RmAtqdiFHEUGaWjUDLbsiXaKG6MRImMw3PQ2URs1vwS1ewnXBd_374Ee9giB8-YizqgVZdLfQV25KqeV2UXtO_uR9m467-0vcoen7LXL_HDtqc
linkProvider ProQuest
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMw1V1LTwIxEJ4gaPTkOz5Qe9DjSre7dNmDMQYlEJUQJYYb6TOSmF1kAfVH-R9tC6sHE28cPG93s-10Zr6Zfp0BOA2INH5dc09ojr0w1pHHDKrwtJK0RuOIUMZds4mo3a71enGnAJ_5XRhLq8xtojPUMhU2R14JLDCwfEz_cvjq2a5R9nQ1b6Ex2xa36uPNhGzZRevayPeMkMZNt9705l0FzE8Q3yNVTjUNCMdcBrHPAx1LHWLJbF1yVo2J1FHAJPaZJqKqaSRwqIwimHcpJ4IH5rNLUDIogsSOKfj4ndIhNDIAPZidnbpKYRU2eh9Mz42XxOcOXPyy-M6NNdb_2QJsQKnDhmq0CQWVbMGKY6uKbBt4PU2mc51hL-jJxPvznCa6moxTW5vT8rORAeTocej6_FgSGrLGb8b7TdAgcWNdxVo7SIln9JCzqdJkB7qLmNUuFJM0UXuATMQYUcxwqKUKfRGZgLSmtGRaC1wTVX8fyrmA-nOFz_o_0jn4-_EJrDa793f9u1b79hDWiMFIjhsYlqE4Hk3UESyL6XiQjY7d5kLQX7AsvwDywA7Y
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Convolutional+Variational+Autoencoders+for+Spectrogram+Compression+in+Automatic+Speech+Recognition&rft.jtitle=arXiv.org&rft.au=Iakovenko%2C+Olga&rft.au=Bondarenko%2C+Ivan&rft.date=2024-10-04&rft.pub=Cornell+University+Library%2C+arXiv.org&rft.eissn=2331-8422&rft_id=info:doi/10.48550%2Farxiv.2410.02560