MambaOPU: An FPGA Overlay Processor for State-space-duality-based Mamba Models
State-space models (SSMs), such as Mamba, have emerged as a promising alternative to Transformers. However, the recently developed Mamba2, based on state space duality (SSD), is highly memorybound and suffers from limited computation efficiency. This inefficiency arises from its irregular broadcast...
Uložené v:
| Vydané v: | 2025 62nd ACM/IEEE Design Automation Conference (DAC) s. 1 - 7 |
|---|---|
| Hlavní autori: | , , , , , , , , |
| Médium: | Konferenčný príspevok.. |
| Jazyk: | English |
| Vydavateľské údaje: |
IEEE
22.06.2025
|
| Predmet: | |
| On-line prístup: | Získať plný text |
| Tagy: |
Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
|
| Shrnutí: | State-space models (SSMs), such as Mamba, have emerged as a promising alternative to Transformers. However, the recently developed Mamba2, based on state space duality (SSD), is highly memorybound and suffers from limited computation efficiency. This inefficiency arises from its irregular broadcast element-wise multiplications and structured sparse computations. In this work, we propose MambaOPU, an FPGA overlay processor, to accelerate SSD. First, to reduce memory overhead, we introduce a software-hardware co-optimized operator fusion framework. Specifically, operator merging combines adjacent broadcast multiplication and summation operations into a single descriptor, while operator backward shifting embeds segment multiplication into subsequent operations. Both techniques shorten the computation path and improve computation efficiency. Second, to enhance sparse computation efficiency, we skip zero-region computations using a tensor-reorder-and-group algorithm combined with a sparse-predefined data fetcher. Additionally, since Mamba integrates linear operations with SSD, we develop a reconfigurable systolic array to improve data reuse across different computation modes. Extensive experiment results demonstrate that MambaOPU achieves up to 1812 \times and 880.79 \times higher normalized throughput and up to 12908 \times and 24.27 \times higher energy efficiency over Intel Xeon Gold 6348 CPU and NVIDIA A100 GPU, respectively. |
|---|---|
| DOI: | 10.1109/DAC63849.2025.11132895 |