FLAG: An FPGA-Based System for Low-Latency GNN Inference Service Using Vector Quantization
Enabling real-time GNN inference services requires low end-to-end latency to meet service level agreements. However, intensive preparation steps and the neighborhood explosion problem pose significant challenges to efficient GNN inference serving. In this paper, we propose FLAG, an FPGA-based GNN in...
Gespeichert in:
| Veröffentlicht in: | 2025 62nd ACM/IEEE Design Automation Conference (DAC) S. 1 - 7 |
|---|---|
| Hauptverfasser: | , , , , |
| Format: | Tagungsbericht |
| Sprache: | Englisch |
| Veröffentlicht: |
IEEE
22.06.2025
|
| Schlagworte: | |
| Online-Zugang: | Volltext |
| Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
| Zusammenfassung: | Enabling real-time GNN inference services requires low end-to-end latency to meet service level agreements. However, intensive preparation steps and the neighborhood explosion problem pose significant challenges to efficient GNN inference serving. In this paper, we propose FLAG, an FPGA-based GNN inference serving system using vector quantization. To reduce preparation overhead, we introduce offline preprocessing to precompute and compress hidden embeddings for serving. A dedicated FPGA accelerator leverages the precomputed data to enable lightweight aggregation. As a result, FLAG achieves average speedups of 154 \times 176 \times, and 333 \times on three GNN models compared to the baseline system. |
|---|---|
| DOI: | 10.1109/DAC63849.2025.11132631 |