Parallelizing and optimizing neural Encoder–Decoder models without padding on multi-core architecture

Scaling up Artificial Intelligence (AI) algorithms for massive datasets to improve their performance is becoming crucial. In Machine Translation (MT), one of most important research fields of AI, models based on Recurrent Neural Networks (RNN) show state-of-the-art performance in recent years, and m...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	Future Generation Computer Systems Ročník 108; s. 1206 - 1213
Hlavní autoři:	Qiao, Yuchen, Hashimoto, Kazuma, Eriguchi, Akiko, Wang, Haixia, Wang, Dongsheng, Tsuruoka, Yoshimasa, Taura, Kenjiro
Médium:	Journal Article
Jazyk:	angličtina japonština
Vydáno:	Elsevier B.V 01.07.2020 Elsevier BV
Témata:	Cache optimization Neural machine translation Parallel programming Neural machine translation Cache optimization Parallel programming
ISSN:	0167-739X
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Popis
Shrnutí:	Scaling up Artificial Intelligence (AI) algorithms for massive datasets to improve their performance is becoming crucial. In Machine Translation (MT), one of most important research fields of AI, models based on Recurrent Neural Networks (RNN) show state-of-the-art performance in recent years, and many researchers keep working on improving RNN-based models to achieve better accuracy in translation tasks. Most implementations of Neural Machine Translation (NMT) models employ a padding strategy when processing a mini-batch to make all sentences in a mini-batch have the same length. This enables an efficient utilization of caches and GPU/SIMD parallelism but leads to a waste of computation time. In this paper, we implement and parallelize batch learning for a Sequence-to-Sequence (Seq2Seq) model, which is the most basic model of NMT, without using a padding strategy. More specifically, our approach forms vectors which represent the input words as well as the neural network’s states at different time steps into matrices when it processes one sentence, and as a result, the approach makes a better use of cache and optimizes the process that adjusts weights and biases during the back-propagation phase. Our experimental evaluation shows that our implementation achieves better scalability on multi-core CPUs. We also discuss our approach’s potential to be used in other implementations of RNN-based models. •We explored the possibility to perform parallel training of Encoder–Decoder models efficiently without using a padding strategy.•Our approach dynamically allocate threads to handle the training process of one pair of sentences to avoid wasting computation.•We optimized our approach through forming vectors generated at different time steps into matrices to get a better cache usage.•The optimized approach makes a Seq2Seq model scale well and have a much better performance.
ISSN:	0167-739X
DOI:	10.1016/j.future.2018.04.070