MambaOPU: An FPGA Overlay Processor for State-space-duality-based Mamba Models

State-space models (SSMs), such as Mamba, have emerged as a promising alternative to Transformers. However, the recently developed Mamba2, based on state space duality (SSD), is highly memorybound and suffers from limited computation efficiency. This inefficiency arises from its irregular broadcast...

Full description

Saved in:
Bibliographic Details
Published in:2025 62nd ACM/IEEE Design Automation Conference (DAC) pp. 1 - 7
Main Authors: Lu, Shaoqiang, Yu, Xuliang, Zhao, Tiandong, Miao, Siyuan, Sheng, Xinsong, Wu, Chen, Zhao, Liang, Lin, Ting-Jung, He, Lei
Format: Conference Proceeding
Language:English
Published: IEEE 22.06.2025
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:State-space models (SSMs), such as Mamba, have emerged as a promising alternative to Transformers. However, the recently developed Mamba2, based on state space duality (SSD), is highly memorybound and suffers from limited computation efficiency. This inefficiency arises from its irregular broadcast element-wise multiplications and structured sparse computations. In this work, we propose MambaOPU, an FPGA overlay processor, to accelerate SSD. First, to reduce memory overhead, we introduce a software-hardware co-optimized operator fusion framework. Specifically, operator merging combines adjacent broadcast multiplication and summation operations into a single descriptor, while operator backward shifting embeds segment multiplication into subsequent operations. Both techniques shorten the computation path and improve computation efficiency. Second, to enhance sparse computation efficiency, we skip zero-region computations using a tensor-reorder-and-group algorithm combined with a sparse-predefined data fetcher. Additionally, since Mamba integrates linear operations with SSD, we develop a reconfigurable systolic array to improve data reuse across different computation modes. Extensive experiment results demonstrate that MambaOPU achieves up to 1812 \times and 880.79 \times higher normalized throughput and up to 12908 \times and 24.27 \times higher energy efficiency over Intel Xeon Gold 6348 CPU and NVIDIA A100 GPU, respectively.
DOI:10.1109/DAC63849.2025.11132895