FLAG: An FPGA-Based System for Low-Latency GNN Inference Service Using Vector Quantization

Enabling real-time GNN inference services requires low end-to-end latency to meet service level agreements. However, intensive preparation steps and the neighborhood explosion problem pose significant challenges to efficient GNN inference serving. In this paper, we propose FLAG, an FPGA-based GNN in...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	2025 62nd ACM/IEEE Design Automation Conference (DAC) S. 1 - 7
Hauptverfasser:	Han, Yunki, Kim, Taehwan, Kim, Jiwan, Ha, Seohye, Kim, Lee-Sup
Format:	Tagungsbericht
Sprache:	Englisch
Veröffentlicht:	IEEE 22.06.2025
Schlagworte:	Design automation Explosions Field programmable gate arrays Graph Neural Network Graph neural networks Hardware acceleration Hardware Accelerator Largescale Graph Low latency communication Real-time systems Service level agreements Vector quantization
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Enabling real-time GNN inference services requires low end-to-end latency to meet service level agreements. However, intensive preparation steps and the neighborhood explosion problem pose significant challenges to efficient GNN inference serving. In this paper, we propose FLAG, an FPGA-based GNN inference serving system using vector quantization. To reduce preparation overhead, we introduce offline preprocessing to precompute and compress hidden embeddings for serving. A dedicated FPGA accelerator leverages the precomputed data to enable lightweight aggregation. As a result, FLAG achieves average speedups of 154 \times 176 \times, and 333 \times on three GNN models compared to the baseline system.
DOI:	10.1109/DAC63849.2025.11132631