A Comparison of Transformer and LSTM Encoder Decoder Models for ASR

We present competitive results using a Transformer encoder-decoder-attention model for end-to-end speech recognition needing less training time compared to a similarly performing LSTM model. We observe that the Transformer training is in general more stable compared to the LSTM, although it also see...

Full description

Saved in:
Bibliographic Details
Published in:2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) pp. 8 - 15
Main Authors: Zeyer, Albert, Bahar, Parnia, Irie, Kazuki, Schluter, Ralf, Ney, Hermann
Format: Conference Proceeding
Language:English
Published: IEEE 01.12.2019
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract We present competitive results using a Transformer encoder-decoder-attention model for end-to-end speech recognition needing less training time compared to a similarly performing LSTM model. We observe that the Transformer training is in general more stable compared to the LSTM, although it also seems to overfit more, and thus shows more problems with generalization. We also find that two initial LSTM layers in the Transformer encoder provide a much better positional encoding. Data-augmentation, a variant of SpecAugment, helps to improve both the Transformer by 33% and the LSTM by 15% relative. We analyze several pretraining and scheduling schemes, which is crucial for both the Transformer and the LSTM models. We improve our LSTM model by additional convolutional layers. We perform our experiments on Lib-riSpeech 1000h, Switchboard 300h and TED-LIUM-v2 200h, and we show state-of-the-art performance on TED-LIUM-v2 for attention based end-to-end models. We deliberately limit the training on LibriSpeech to 12.5 epochs of the training data for comparisons, to keep the results of practical interest, although we show that longer training time still improves more. We publish all the code and setups to run our experiments.
AbstractList We present competitive results using a Transformer encoder-decoder-attention model for end-to-end speech recognition needing less training time compared to a similarly performing LSTM model. We observe that the Transformer training is in general more stable compared to the LSTM, although it also seems to overfit more, and thus shows more problems with generalization. We also find that two initial LSTM layers in the Transformer encoder provide a much better positional encoding. Data-augmentation, a variant of SpecAugment, helps to improve both the Transformer by 33% and the LSTM by 15% relative. We analyze several pretraining and scheduling schemes, which is crucial for both the Transformer and the LSTM models. We improve our LSTM model by additional convolutional layers. We perform our experiments on Lib-riSpeech 1000h, Switchboard 300h and TED-LIUM-v2 200h, and we show state-of-the-art performance on TED-LIUM-v2 for attention based end-to-end models. We deliberately limit the training on LibriSpeech to 12.5 epochs of the training data for comparisons, to keep the results of practical interest, although we show that longer training time still improves more. We publish all the code and setups to run our experiments.
Author Bahar, Parnia
Schluter, Ralf
Irie, Kazuki
Ney, Hermann
Zeyer, Albert
Author_xml – sequence: 1
  givenname: Albert
  surname: Zeyer
  fullname: Zeyer, Albert
  organization: AppTek GmbH,Aachen,Germany,52062
– sequence: 2
  givenname: Parnia
  surname: Bahar
  fullname: Bahar, Parnia
  organization: AppTek GmbH,Aachen,Germany,52062
– sequence: 3
  givenname: Kazuki
  surname: Irie
  fullname: Irie, Kazuki
  organization: Human Language Technology and Pattern Recognition, RWTH Aachen University,Computer Science Department,Aachen,Germany,52074
– sequence: 4
  givenname: Ralf
  surname: Schluter
  fullname: Schluter, Ralf
  organization: Human Language Technology and Pattern Recognition, RWTH Aachen University,Computer Science Department,Aachen,Germany,52074
– sequence: 5
  givenname: Hermann
  surname: Ney
  fullname: Ney, Hermann
  organization: AppTek GmbH,Aachen,Germany,52062
BookMark eNotT0tqwzAUVKFZtGlOECi6gN33ZFmWlsZNm4JDoXHWQdYHDLEU5G56-wqSzQwMw3yeyWOIwRHyilAignprjz8nLkBhyQBVqQA4sPqBbFQjsWESoQIhn0jX0i7OV52mJQYaPR2SDouPaXaJ6mBpfxwOdBdMtFl4dzc-ZLwsNNtoLnohK68vi9vceU1OH7uh2xf99-dX1_bFxIT6LbS0yAEahyPzYmwsr0cjxZhHCq-4q1EZNOBlI5BLg3qslK2U0MzUYKyt1mR7y52cc-drmmad_s73a9U_8yhHGA
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/ASRU46091.2019.9004025
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEEE/IET Electronic Library
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 9781728103068
1728103061
EndPage 15
ExternalDocumentID 9004025
Genre orig-research
GroupedDBID 6IE
6IL
CBEJK
RIE
RIL
ID FETCH-LOGICAL-i269t-a8d14007e1b2f6b7d45bc86b0916f94e519c1c0f876148c1ab39d396a2c50cdd3
IEDL.DBID RIE
ISICitedReferencesCount 157
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000539883100002&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate Thu Jun 29 18:38:33 EDT 2023
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i269t-a8d14007e1b2f6b7d45bc86b0916f94e519c1c0f876148c1ab39d396a2c50cdd3
PageCount 8
ParticipantIDs ieee_primary_9004025
PublicationCentury 2000
PublicationDate 2019-12-01
PublicationDateYYYYMMDD 2019-12-01
PublicationDate_xml – month: 12
  year: 2019
  text: 2019-12-01
  day: 01
PublicationDecade 2010
PublicationTitle 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
PublicationTitleAbbrev ASRU
PublicationYear 2019
Publisher IEEE
Publisher_xml – name: IEEE
Score 2.2671425
Snippet We present competitive results using a Transformer encoder-decoder-attention model for end-to-end speech recognition needing less training time compared to a...
SourceID ieee
SourceType Publisher
StartPage 8
SubjectTerms attention
Convergence
Convolutional codes
Data models
Decoding
Encoding
end-to-end ASR
Hidden Markov models
LSTM
Training
Transformer
Title A Comparison of Transformer and LSTM Encoder Decoder Models for ASR
URI https://ieeexplore.ieee.org/document/9004025
WOSCitedRecordID wos000539883100002&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3Pa8IwFH6o7LDTNnTsNznsuGrTpklzFKfs4ESmgjfJj1cQRh217u9f0naOwS47JYRAkpeQ7yV5Xz6AR0w0FdzKIEt1FDBmeKAU0sBhT2xFSFOVVUThqZjN0vVazlvwdOTCIGIVfIZ9n63e8u3OHPxV2UD6JRclbWgLwWuuVkP6paEcDBdvK8Yd_vmALdlvKv9STalAY3L2v-bOoffDviPzI65cQAvzLoyGZHSUDCS7jCy_PU4siMotmS6Wr2Sce4p6QZ6xTr3S2fueuGrE9bsHq8l4OXoJGgGEYBtxWQYqtdTrliPVUca1sCzRJuXajZFnkqHzvgw1YeZ2NHeqMVTpWNpYchWZJDTWxpfQyXc5XgHhAqXnlKKDaMakmwprnDPDhbReDV1cQ9cbYPNR_3GxacZ-83fxLZx6G9dhHXfQKYsD3sOJ-Sy3--KhmpgvKkCN8A
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3PS8MwFH7MKehJZRN_m4NHuzVtmjbHMTcmdmO4DnYbTfIKA-mk2_z7Tdo6Ebx4SgiBJC8h30vyvnwAjxhIGnItnCySnsOY4k6aInUM9vg6dGmUZiVROA4nk2ixENMGPO25MIhYBp9hx2bLt3y9Vjt7VdYVdsl5wQEcBox5bsXWqmm_1BXd3uxtzrhBQBuyJTp19V-6KSVsDE__1-AZtH_4d2S6R5ZzaGDegn6P9PeigWSdkeTb58SCpLkm8SwZk0FuSeoFecYqtVpn7xtiqhHT7zbMh4OkP3JqCQRn5XGxddJIU6tcjlR6GZehZoFUEZdmjDwTDI3_pahyM7OnmXONoqn0hfYFTz0VuEpr_wKa-TrHSyA8RGFZpWhAmjFhJkMr487wUGirhx5eQcsaYPlR_XKxrMd-_XfxAxyPknG8jF8mrzdwYu1dBXncQnNb7PAOjtTndrUp7stJ-gLA3pE3
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2019+IEEE+Automatic+Speech+Recognition+and+Understanding+Workshop+%28ASRU%29&rft.atitle=A+Comparison+of+Transformer+and+LSTM+Encoder+Decoder+Models+for+ASR&rft.au=Zeyer%2C+Albert&rft.au=Bahar%2C+Parnia&rft.au=Irie%2C+Kazuki&rft.au=Schluter%2C+Ralf&rft.date=2019-12-01&rft.pub=IEEE&rft.spage=8&rft.epage=15&rft_id=info:doi/10.1109%2FASRU46091.2019.9004025&rft.externalDocID=9004025