TGPA: Tile-Grained Pipeline Architecture for Low Latency CNN Inference

FPGAs are more and more widely used as reconfigurable hardware accelerators for applications leveraging convolutional neural networks (CNNs) in recent years. Previous designs normally adopt a uniform accelerator architecture that processes all layers of a given CNN model one after another. This homo...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD) s. 1 - 8
Hlavní autori: Wei, Xuechao, Liang, Yun, Li, Xiuhong, Yu, Cody Hao, Zhang, Peng, Cong, Jason
Médium: Konferenčný príspevok..
Jazyk:English
Vydavateľské údaje: ACM 01.11.2018
Predmet:
ISSN:1558-2434
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Popis
Shrnutí:FPGAs are more and more widely used as reconfigurable hardware accelerators for applications leveraging convolutional neural networks (CNNs) in recent years. Previous designs normally adopt a uniform accelerator architecture that processes all layers of a given CNN model one after another. This homogeneous design methodology usually has dynamic resource underutilization issue due to the tensor shape diversity of different layers. As a result, designs equipped with heterogeneous accelerators specific for different layers were proposed to resolve this issue. However, existing heterogeneous designs sacrifice latency for throughput by concurrent execution of multiple input images on different accelerators. In this paper, we propose an architecture named Tile-Grained Pipeline Architecture (TGPA) for low latency CNN inference. TGPA adopts a heterogeneous design which supports pipelining execution of multiple tiles within a single input image on multiple heterogeneous accelerators. The accelerators are partitioned onto different FPGA dies to guarantee high frequency. A partition strategy is designd to maximize on-chip resource utilization. Experiment results show that TGPA designs for different CNN models achieve up to 40% performance improvement than homogeneous designs, and 3X latency reduction over state-of-the-art designs.
ISSN:1558-2434
DOI:10.1145/3240765.3240856