Architectural Trade-Off Analysis for Accelerating LSTM Network Using Radix-r OBC Scheme

This paper presents architectural trade-off analysis for accelerating two (Type I, II) fixed-point long short-term memory (LSTM) network based on circulant matrix-vector multiplications (MVMs) using radix-<inline-formula> <tex-math notation="LaTeX">r </tex-math></inlin...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	IEEE transactions on circuits and systems. I, Regular papers Ročník 70; číslo 1; s. 266 - 279
Hlavní autoři:	Khan, Mohd Tasleem, Yantir, Hasan Erdem, Salama, Khaled Nabil, Eltawil, Ahmed M.
Médium:	Journal Article
Jazyk:	angličtina
Vydáno:	New York IEEE 01.01.2023 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Témata:	Architecture Binary codes Circulant matrices Clocks Computational modeling Computer architecture Logic gates long short-term memory (LSTM) Mathematical analysis Matrix algebra matrix-vector multiplication (MVM) offset binary coding (OBC) Random access memory recurrent neural network (RNN) Table lookup Throughput Time multiplexing Tradeoffs
ISSN:	1549-8328, 1558-0806
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Popis
Shrnutí:	This paper presents architectural trade-off analysis for accelerating two (Type I, II) fixed-point long short-term memory (LSTM) network based on circulant matrix-vector multiplications (MVMs) using radix-<inline-formula> <tex-math notation="LaTeX">r </tex-math></inline-formula> offset binary coding (OBC) scheme. Type I MVM architecture rotates the weights with the proposed modulo-cum interleaver and uses partial product generators (PPGs) with a single generation unit across a column. It is hardware-optimized using a single adder tree through time-multiplexing. Meanwhile, Type II MVM architecture rotates the inputs with the proposed store-cum interleaver and uses single PPGs with a single generation unit across a row. It is time-optimized by unfolding shift-accumulate unit to a shift-add tree followed by pipelining. A new design for element-wise multiplication using radix-<inline-formula> <tex-math notation="LaTeX">r </tex-math></inline-formula> PPG is also presented. Both the designs are extended to their block-circulant variants for certain accuracy requirements. Post-synthesis of Type I and II architectures for a different model, kernel, radix sizes and clock frequencies result in several efficient designs. Compared with the prior scheme, Type I architecture for <inline-formula> <tex-math notation="LaTeX">128 \times 128 </tex-math></inline-formula> with <inline-formula> <tex-math notation="LaTeX">r=2 </tex-math></inline-formula> on 28 nm FDSOI technology at 800 MHz occupies 32.27% lesser area, consumes 67.89% lesser power at the same throughput, while Type II architecture at the expense of area and power provides <inline-formula> <tex-math notation="LaTeX">40\times </tex-math></inline-formula> higher throughput.
Bibliografie:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1549-8328 1558-0806
DOI:	10.1109/TCSI.2022.3217091