Analysis of Model Parallelism for AI Applications on a 64-core RV64 Server CPU

Massive Data Parallel workloads, driven by inference on large ML models, are pushing hardware vendors to develop efficient and cost-effective multi-core server CPUs. The RISC-V architecture plays a prominent role due to its open, extensible, and energy-friendly ISA. Despite significant progress in r...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	International journal of parallel programming Ročník 53; číslo 4; s. 27
Hlavní autoři:	Malenza, Giulio, Garcia, Adriano Marques, Birke, Robert, Benini, Luca, Aldinucci, Marco
Médium:	Journal Article
Jazyk:	angličtina
Vydáno:	New York Springer US 01.08.2025 Springer Nature B.V
Témata:	Artificial intelligence Central processing units Computer Science Computer vision CPUs Design Fashion models Hardware Inference Machine learning Neural networks Open standards Parallel processing Pipelining (computers) Processor Architectures Servers Software Software Engineering/Programming and Operating Systems Theory of Computation Workloads AI RISC-V Model parallelism PyTorch SOPHON SG2042
ISSN:	0885-7458, 1573-7640
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Popis
Shrnutí:	Massive Data Parallel workloads, driven by inference on large ML models, are pushing hardware vendors to develop efficient and cost-effective multi-core server CPUs. The RISC-V architecture plays a prominent role due to its open, extensible, and energy-friendly ISA. Despite significant progress in recent years, finding efficient methods to run AI applications in parallel on new architectures to fully harness their maximum performance remains a challenge. In this study, we investigate the impact of model parallelism on the inference of machine learning models on the SOPHON SG2042 SoC, the first server-grade CPU based on the RV64 ISA, composed of 64 cores arranged in a grid of 16 groups of 4 cores. Specifically, we aim to enhance performance via better data locality stemming from splitting and assigning parts of the model to specific (groups of) cores handling dependencies via a pipeline execution. We orchestrate execution using FastFlow, a low-level programming framework designed for multithreaded streaming applications. By comparing the results against the standard multi-core inference approach based on data parallelism and analyzing the effects of different submodel-to-core mapping strategies, we aim to provide a comprehensive understanding of how the model parallel approach can maximize efficiency and utilization of hardware resources. In our experiments, using model parallelism improved up to 8.4 times the performance over the native PyTorch parallelism.
Bibliografie:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	0885-7458 1573-7640
DOI:	10.1007/s10766-025-00802-6