PARO: Hardware-Software Co-design with Pattern-aware Reorder-based Attention Quantization in Video Generation Models
Transformer-based video generation models have demonstrated significant potential in content creation. However, the current state-of-the-art model employing " 3 D full attention" encounters substantial computation and storage challenges. For instance, the attention map size for \operatorna...
Uložené v:
| Vydané v: | 2025 62nd ACM/IEEE Design Automation Conference (DAC) s. 1 - 7 |
|---|---|
| Hlavní autori: | , , , , , , , , |
| Médium: | Konferenčný príspevok.. |
| Jazyk: | English |
| Vydavateľské údaje: |
IEEE
22.06.2025
|
| Predmet: | |
| On-line prístup: | Získať plný text |
| Tagy: |
Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
|
| Shrnutí: | Transformer-based video generation models have demonstrated significant potential in content creation. However, the current state-of-the-art model employing " 3 D full attention" encounters substantial computation and storage challenges. For instance, the attention map size for \operatorname{Cog} VideoX-5B requires 56.50 GB, and generating a video of 49 frames takes approximately 1 minute on an NVIDIA A100 GPU under FP16. Although model quantization has proven effective in reducing both memory and computational costs, applying it to video generation models still faces challenges in preserving algorithm performance while ensuring efficient hardware processing. To address these issues, we introduce PARO, a video generation accelerator with patternaware reorder-based attention quantization. PARO investigates the diverse attention patterns of 3D full attention and proposes a novel reorder technique to unify these patterns into a unified "block diagonal" structure. Block-wise mixed precision quantization is further applied to achieve lossless compression under an average bitwidth of 4.80 bits. In terms of hardware, to overcome the limitation of existing mixed-precision computing units could not fully utilize the attention map bitwidth to accelerate Q K multiplication, PARO designs an output-bitwidth aware mixedprecision processing element (PE) array through hardwaresoftware co-design. This approach ensures that the mixedprecision characteristics are fully utilized to enhance hardware efficiency in the bottleneck attention computation. Experiments demonstrate that PARO delivers up to 2.71 \times improvement in end-to-end performance compared to an NVIDIA A100 GPU and achieves up to 6.38 \sim 7.05 \times speedup over state-of-the-art ASICbased accelerators on the CogVideoX-2B and 5B models. |
|---|---|
| DOI: | 10.1109/DAC63849.2025.11133383 |