PIEED: Position information enhanced encoder-decoder framework for scene text recognition

Scene text recognition (STR) technology has a rapid development with the rise of deep learning. Recently, the encoder-decoder framework based on attention mechanism is widely used in STR for better recognition. However, the commonly used Long Short Term Memory (LSTM) network in the framework tends t...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Applied intelligence (Dordrecht, Netherlands) Jg. 51; H. 10; S. 6698 - 6707
Hauptverfasser: Ma, Xitao, He, Kai, Zhang, Dazhuang, Li, Dashuang
Format: Journal Article
Sprache:Englisch
Veröffentlicht: New York Springer US 01.10.2021
Springer Nature B.V
Schlagworte:
ISSN:0924-669X, 1573-7497
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Scene text recognition (STR) technology has a rapid development with the rise of deep learning. Recently, the encoder-decoder framework based on attention mechanism is widely used in STR for better recognition. However, the commonly used Long Short Term Memory (LSTM) network in the framework tends to ignore certain position or visual information. To address this problem, we propose a Position Information Enhanced Encoder-Decoder (PIEED) framework for scene text recognition, in which an addition position information enhancement (PIE) module is proposed to compensate the shortage of the LSTM network. Our module tends to retain more position information in the feature sequence, as well as the context information extracted by the LSTM network, which is helpful to improve the recognition accuracy of the text without context. Besides that, our fusion decoder can make full use of the output of the proposed module and the LSTM network, so as to independently learn and preserve useful features, which is helpful to improve the recognition accuracy while not increase the number of arguments. Our overall framework can be trained end-to-end only using images and ground truth. Extensive experiments on several benchmark datasets demonstrate that our proposed framework surpass state-of-the-art ones on both regular and irregular text recognition.
Bibliographie:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:0924-669X
1573-7497
DOI:10.1007/s10489-021-02219-3