Examining the Viability of Row-Scale Disaggregation for Production Applications

Row-scale Composable Disaggregated Infrastructure (CDI) is a heterogeneous high performance computing (HPC) architecture that relocates the GPUs to a single chassis which CPU nodes can then request compute resources from. This is a distinctly different architecture from rack-scaled CDI as the GPUs a...

Full description

Saved in:
Bibliographic Details
Published in:SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis pp. 1613 - 1621
Main Authors: Shorts, Curtis, Grant, Ryan E.
Format: Conference Proceeding
Language:English
Published: IEEE 17.11.2024
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Row-scale Composable Disaggregated Infrastructure (CDI) is a heterogeneous high performance computing (HPC) architecture that relocates the GPUs to a single chassis which CPU nodes can then request compute resources from. This is a distinctly different architecture from rack-scaled CDI as the GPUs are accessed over a network rather than existing in the same PCIe domain as the CPUs. Row-scale CDI expands the benefits and flexibility of rack-scaled CDI, while introducing new challenges. For example, with row-scale CDI, one must account for the effects of "slack", a latency in the CPU-to-GPU communication times due to network delays. This work seeks to assess potential challenges with row-scale CDI to determine which factors are most important to consider when deploying a CDI system. Our strong scaling application analyses reveal that there are two types of HPC workloads that may benefit from row-scale CDI; those that are CPU dominant and periodically call on the GPU to do highly parallel tasks and those that are GPU dominant and primarily rely on the CPU to coordinate work. We perform comparisons between the kernel and data transfer characteristics of each application to a slack proxy application which allowed for the development of a mathematical model to predict the performance penalty different applications can face as a result of slack. To illustrate this we profile two applications using our proposed method and find that they pessimistically would see a less than a 1% performance penalty above the effects of crossing the network in an environment which induced 100 µs of slack, or a distance of 20 km at the speed of light in a fibre optic network cable. This demonstrates that both row-scale and cluster-scale CDI are viable technologies from an application performance perspective.
DOI:10.1109/SCW63240.2024.00201