A Comparison of Transformer and LSTM Encoder Decoder Models for ASR
We present competitive results using a Transformer encoder-decoder-attention model for end-to-end speech recognition needing less training time compared to a similarly performing LSTM model. We observe that the Transformer training is in general more stable compared to the LSTM, although it also see...
Saved in:
| Published in: | 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) pp. 8 - 15 |
|---|---|
| Main Authors: | , , , , |
| Format: | Conference Proceeding |
| Language: | English |
| Published: |
IEEE
01.12.2019
|
| Subjects: | |
| Online Access: | Get full text |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Abstract | We present competitive results using a Transformer encoder-decoder-attention model for end-to-end speech recognition needing less training time compared to a similarly performing LSTM model. We observe that the Transformer training is in general more stable compared to the LSTM, although it also seems to overfit more, and thus shows more problems with generalization. We also find that two initial LSTM layers in the Transformer encoder provide a much better positional encoding. Data-augmentation, a variant of SpecAugment, helps to improve both the Transformer by 33% and the LSTM by 15% relative. We analyze several pretraining and scheduling schemes, which is crucial for both the Transformer and the LSTM models. We improve our LSTM model by additional convolutional layers. We perform our experiments on Lib-riSpeech 1000h, Switchboard 300h and TED-LIUM-v2 200h, and we show state-of-the-art performance on TED-LIUM-v2 for attention based end-to-end models. We deliberately limit the training on LibriSpeech to 12.5 epochs of the training data for comparisons, to keep the results of practical interest, although we show that longer training time still improves more. We publish all the code and setups to run our experiments. |
|---|---|
| AbstractList | We present competitive results using a Transformer encoder-decoder-attention model for end-to-end speech recognition needing less training time compared to a similarly performing LSTM model. We observe that the Transformer training is in general more stable compared to the LSTM, although it also seems to overfit more, and thus shows more problems with generalization. We also find that two initial LSTM layers in the Transformer encoder provide a much better positional encoding. Data-augmentation, a variant of SpecAugment, helps to improve both the Transformer by 33% and the LSTM by 15% relative. We analyze several pretraining and scheduling schemes, which is crucial for both the Transformer and the LSTM models. We improve our LSTM model by additional convolutional layers. We perform our experiments on Lib-riSpeech 1000h, Switchboard 300h and TED-LIUM-v2 200h, and we show state-of-the-art performance on TED-LIUM-v2 for attention based end-to-end models. We deliberately limit the training on LibriSpeech to 12.5 epochs of the training data for comparisons, to keep the results of practical interest, although we show that longer training time still improves more. We publish all the code and setups to run our experiments. |
| Author | Bahar, Parnia Schluter, Ralf Irie, Kazuki Ney, Hermann Zeyer, Albert |
| Author_xml | – sequence: 1 givenname: Albert surname: Zeyer fullname: Zeyer, Albert organization: AppTek GmbH,Aachen,Germany,52062 – sequence: 2 givenname: Parnia surname: Bahar fullname: Bahar, Parnia organization: AppTek GmbH,Aachen,Germany,52062 – sequence: 3 givenname: Kazuki surname: Irie fullname: Irie, Kazuki organization: Human Language Technology and Pattern Recognition, RWTH Aachen University,Computer Science Department,Aachen,Germany,52074 – sequence: 4 givenname: Ralf surname: Schluter fullname: Schluter, Ralf organization: Human Language Technology and Pattern Recognition, RWTH Aachen University,Computer Science Department,Aachen,Germany,52074 – sequence: 5 givenname: Hermann surname: Ney fullname: Ney, Hermann organization: AppTek GmbH,Aachen,Germany,52062 |
| BookMark | eNotT0tqwzAUVKFZtGlOECi6gN33ZFmWlsZNm4JDoXHWQdYHDLEU5G56-wqSzQwMw3yeyWOIwRHyilAignprjz8nLkBhyQBVqQA4sPqBbFQjsWESoQIhn0jX0i7OV52mJQYaPR2SDouPaXaJ6mBpfxwOdBdMtFl4dzc-ZLwsNNtoLnohK68vi9vceU1OH7uh2xf99-dX1_bFxIT6LbS0yAEahyPzYmwsr0cjxZhHCq-4q1EZNOBlI5BLg3qslK2U0MzUYKyt1mR7y52cc-drmmad_s73a9U_8yhHGA |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IL CBEJK RIE RIL |
| DOI | 10.1109/ASRU46091.2019.9004025 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE/IET Electronic Library IEEE Proceedings Order Plans (POP All) 1998-Present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| EISBN | 9781728103068 1728103061 |
| EndPage | 15 |
| ExternalDocumentID | 9004025 |
| Genre | orig-research |
| GroupedDBID | 6IE 6IL CBEJK RIE RIL |
| ID | FETCH-LOGICAL-i269t-a8d14007e1b2f6b7d45bc86b0916f94e519c1c0f876148c1ab39d396a2c50cdd3 |
| IEDL.DBID | RIE |
| ISICitedReferencesCount | 157 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000539883100002&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| IngestDate | Thu Jun 29 18:38:33 EDT 2023 |
| IsPeerReviewed | false |
| IsScholarly | false |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-i269t-a8d14007e1b2f6b7d45bc86b0916f94e519c1c0f876148c1ab39d396a2c50cdd3 |
| PageCount | 8 |
| ParticipantIDs | ieee_primary_9004025 |
| PublicationCentury | 2000 |
| PublicationDate | 2019-12-01 |
| PublicationDateYYYYMMDD | 2019-12-01 |
| PublicationDate_xml | – month: 12 year: 2019 text: 2019-12-01 day: 01 |
| PublicationDecade | 2010 |
| PublicationTitle | 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) |
| PublicationTitleAbbrev | ASRU |
| PublicationYear | 2019 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| Score | 2.2671425 |
| Snippet | We present competitive results using a Transformer encoder-decoder-attention model for end-to-end speech recognition needing less training time compared to a... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 8 |
| SubjectTerms | attention Convergence Convolutional codes Data models Decoding Encoding end-to-end ASR Hidden Markov models LSTM Training Transformer |
| Title | A Comparison of Transformer and LSTM Encoder Decoder Models for ASR |
| URI | https://ieeexplore.ieee.org/document/9004025 |
| WOSCitedRecordID | wos000539883100002&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3Pa8IwFH6o7LDTNnTsNznsuGrTpklzFKfs4ESmgjfJj1cQRh217u9f0naOwS47JYRAkpeQ7yV5Xz6AR0w0FdzKIEt1FDBmeKAU0sBhT2xFSFOVVUThqZjN0vVazlvwdOTCIGIVfIZ9n63e8u3OHPxV2UD6JRclbWgLwWuuVkP6paEcDBdvK8Yd_vmALdlvKv9STalAY3L2v-bOoffDviPzI65cQAvzLoyGZHSUDCS7jCy_PU4siMotmS6Wr2Sce4p6QZ6xTr3S2fueuGrE9bsHq8l4OXoJGgGEYBtxWQYqtdTrliPVUca1sCzRJuXajZFnkqHzvgw1YeZ2NHeqMVTpWNpYchWZJDTWxpfQyXc5XgHhAqXnlKKDaMakmwprnDPDhbReDV1cQ9cbYPNR_3GxacZ-83fxLZx6G9dhHXfQKYsD3sOJ-Sy3--KhmpgvKkCN8A |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3PS8MwFH7MKehJZRN_m4NHuzVtmjbHMTcmdmO4DnYbTfIKA-mk2_z7Tdo6Ebx4SgiBJC8h30vyvnwAjxhIGnItnCySnsOY4k6aInUM9vg6dGmUZiVROA4nk2ixENMGPO25MIhYBp9hx2bLt3y9Vjt7VdYVdsl5wQEcBox5bsXWqmm_1BXd3uxtzrhBQBuyJTp19V-6KSVsDE__1-AZtH_4d2S6R5ZzaGDegn6P9PeigWSdkeTb58SCpLkm8SwZk0FuSeoFecYqtVpn7xtiqhHT7zbMh4OkP3JqCQRn5XGxddJIU6tcjlR6GZehZoFUEZdmjDwTDI3_pahyM7OnmXONoqn0hfYFTz0VuEpr_wKa-TrHSyA8RGFZpWhAmjFhJkMr487wUGirhx5eQcsaYPlR_XKxrMd-_XfxAxyPknG8jF8mrzdwYu1dBXncQnNb7PAOjtTndrUp7stJ-gLA3pE3 |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2019+IEEE+Automatic+Speech+Recognition+and+Understanding+Workshop+%28ASRU%29&rft.atitle=A+Comparison+of+Transformer+and+LSTM+Encoder+Decoder+Models+for+ASR&rft.au=Zeyer%2C+Albert&rft.au=Bahar%2C+Parnia&rft.au=Irie%2C+Kazuki&rft.au=Schluter%2C+Ralf&rft.date=2019-12-01&rft.pub=IEEE&rft.spage=8&rft.epage=15&rft_id=info:doi/10.1109%2FASRU46091.2019.9004025&rft.externalDocID=9004025 |