PIMCOMP: A Universal Compilation Framework for Crossbar-based PIM DNN Accelerators

Crossbar-based PIM DNN accelerators can provide massively parallel in-situ operations. A specifically designed compiler is important to achieve high performance for a wide variety of DNN workloads. However, some key compilation issues such as parallelism considerations, weight replication selection,...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	2023 60th ACM/IEEE Design Automation Conference (DAC) S. 1 - 6
Hauptverfasser:	Sun, Xiaotian, Wang, Xinyu, Li, Wanqian, Wang, Lei, Han, Yinhe, Chen, Xiaoming
Format:	Tagungsbericht
Sprache:	Englisch
Veröffentlicht:	IEEE 09.07.2023
Schlagworte:	Accelerator architectures compilation framework deep neural network Design automation Low latency communication NVM Parallel processing PIM accelerator Pipelines Power demand Throughput
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Crossbar-based PIM DNN accelerators can provide massively parallel in-situ operations. A specifically designed compiler is important to achieve high performance for a wide variety of DNN workloads. However, some key compilation issues such as parallelism considerations, weight replication selection, and array mapping methods have not been solved. In this work, we propose PIMCOMP - a universal compilation framework for NVM crossbar-based PIM DNN accelerators. PIMCOMP is built on an abstract PIM accelerator architecture, which is compatible with the widely used Crossbar/IMA/Tile/Chip hierarchy. On this basis, we propose four general compilation stages for crossbar-based PIM accelerators: node partitioning, weight replicating, core mapping, and dataflow scheduling. We design two compilation modes with different inter-layer pipeline granularities to support high-throughput and low-latency application scenarios, respectively. Our experimental results show that PIMCMOP yields improvements of 1.6× and 2.4× in throughput and latency, respectively, relative to PUMA.
DOI:	10.1109/DAC56929.2023.10247928