Celerity-RSim: Porting Light Propagation Simulation to Accelerator Clusters Using a High-Level API

Time-of-Flight (ToF) camera systems are increasingly capable of analyzing larger 3D spaces and providing more detailed and precise results. To increase the speed-to-solution during development, testing and validation of such systems, light propagation simulation is employed. One such simulation, RSi...

Full description

Saved in:
Bibliographic Details
Published in:International journal of parallel programming Vol. 53; no. 3; p. 17
Main Authors: Thoman, Peter, Gschwandtner, Philipp, Molina Heredina, Facundo, Fahringer, Thomas
Format: Journal Article
Language:English
Published: New York Springer US 01.06.2025
Springer Nature B.V
Subjects:
ISSN:0885-7458, 1573-7640
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Time-of-Flight (ToF) camera systems are increasingly capable of analyzing larger 3D spaces and providing more detailed and precise results. To increase the speed-to-solution during development, testing and validation of such systems, light propagation simulation is employed. One such simulation, RSim, was previously performed on single workstations, however, the increase in detail required for newer ToF hardware necessitates cluster-level parallelism in order to maintain an experiment latency which enables productive design work. Celerity is a high-level parallel API and runtime system for clusters of accelerators intended to simplify the development of domain science applications. It automatically manages data and work distribution, while also transparently enabling asynchronous compute and communication overlapping. In this paper, we present a use case study of porting the full RSim application to GPU clusters using the Celerity system. In order to improve scalability, a new parallelization scheme was employed for the core simulation task, and Celerity was extended with a high-level split constraints feature which enables this scheme. We present strong- and weak-scaling experiments for the resulting application on three accelerator clusters and up to 128 GPUs, and also evaluate the relative programming effort required to distribute the application on multiple GPUs using different APIs.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:0885-7458
1573-7640
DOI:10.1007/s10766-025-00787-2