Real-time optical flow processing on embedded GPU: an hardware-aware algorithm to implementation strategy

Determining the optical flow of a video is a compute-intensive task essential for computer vision. For achieving this processing in real time, the whole algorithm deployment chain must be thought of for efficiency first. The development is usually divided into two parts: first, designing an algorith...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Journal of real-time image processing Ročník 19; číslo 2; s. 317 - 329
Hlavní autoři: Seznec, Mickaël, Gac, Nicolas, Orieux, François, Naik, Alvin Sashala
Médium: Journal Article
Jazyk:angličtina
Vydáno: Berlin/Heidelberg Springer Berlin Heidelberg 01.04.2022
Springer Nature B.V
Springer Verlag
Témata:
ISSN:1861-8200, 1861-8219
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:Determining the optical flow of a video is a compute-intensive task essential for computer vision. For achieving this processing in real time, the whole algorithm deployment chain must be thought of for efficiency first. The development is usually divided into two parts: first, designing an algorithm that meets precision constraints, then, implementing and optimizing its execution on the targeted platform. We argue that unifying those operations enhances performance on the embedded processor. This paper is based on an industrial use case of computer vision. The objective is to determine dense optical flow in real time on an embedded GPU platform: the Nvidia AGX Xavier. The CLG (combined local–global) optical flow method, initially chosen, is analyzed to understand the convergence speed of its underlying optimization problem. The Jacobi solver is selected for implementation because of its parallel nature. The whole multi-level processing is then ported to the GPU, using several specific optimization strategies. In particular, we analyze the impact of fusing the solver’s iterations with the roofline model. As a result, with a 30 W power budget, our implementation runs at 60FPS, on 640 × 512 images, with a four-level processing. Hopefully, this example should provide feedback on the issues that arise when trying to port a method to a parallel platform and serve for further implementations of computer vision algorithms on specialized hardware.
Bibliografie:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:1861-8200
1861-8219
DOI:10.1007/s11554-021-01187-8