Adaptive scheduling of inference pipelines on multicore architectures
Gespeichert in:
| Titel: | Adaptive scheduling of inference pipelines on multicore architectures |
|---|---|
| Autoren: | Soomro, Pirah Noor, 1993 |
| Schlagwörter: | Online tuning, CNN parallel pipelines, Design space exploration, Interference Mitigation, Heterogeneous computing units, Processing on chiplets, Inference Serving Systems |
| Beschreibung: | In today’s data-driven world, machine learning (ML) algorithms, particularly Convolutional Neural Networks (CNNs), play a pivotal role in powering a myriad of applications across various domains. As the demand for real-time inference continues to escalate, optimizing CNN inference across diverse computational platforms becomes imperative. This thesis addresses this challenge by exploring the complexities posed by heterogeneous edge devices, chiplet-based architectures, and inference-serving systems. Heterogeneous edge devices present unique challenges due to resource constraints and architectural diversity, while chiplet-based architectures offer potential enhancements in inference performance. Leveraging innovative techniques such as online tuning algorithms, malleable and moldable inference pipelines, and adaptive scheduling strategies, our thesis proposes a comprehensive framework for optimizing DNN inference. This framework aims to advance system performance, reduce latency, and mitigate interference effects, thereby contributing to the development of more efficient and scalable AI systems capable of meeting the evolving demands of real-time inference across diverse computational platforms. The thesis addresses several key problem statements, including enabling runtime scheduling of inference pipelines on edge devices, fully online scheduling of inference pipelines on heterogeneous platforms, mitigating interference effects on inference pipelines in inference-serving systems, and optimizing resource allocation in inference-serving systems for adaptive SLO-aware inference serving. The contributions of this thesis are encapsulated in four papers, each focusing on distinct aspects of CNN inference optimization. These contributions include the development of comprehensive frameworks for online scheduling of CNN pipelines, leveraging platform knowledge for expedited seed generation, dynamic scheduling techniques to alleviate interference effects, and SLO-aware scheduling techniques for optimizing resource allocation in inference-serving systems. Through these contributions, this thesis seeks to advance the state-of-the-art in CNN inference optimization and inference-serving systems, paving the way for more efficient and scalable AI systems capable of meeting the demands of real-time inference across diverse computational platforms. |
| Dateibeschreibung: | electronic |
| Zugangs-URL: | https://research.chalmers.se/publication/547635 https://research.chalmers.se/publication/547635/file/547635_Fulltext.pdf |
| Datenbank: | SwePub |
| Abstract: | In today’s data-driven world, machine learning (ML) algorithms, particularly Convolutional Neural Networks (CNNs), play a pivotal role in powering a myriad of applications across various domains. As the demand for real-time inference continues to escalate, optimizing CNN inference across diverse computational platforms becomes imperative. This thesis addresses this challenge by exploring the complexities posed by heterogeneous edge devices, chiplet-based architectures, and inference-serving systems. Heterogeneous edge devices present unique challenges due to resource constraints and architectural diversity, while chiplet-based architectures offer potential enhancements in inference performance. Leveraging innovative techniques such as online tuning algorithms, malleable and moldable inference pipelines, and adaptive scheduling strategies, our thesis proposes a comprehensive framework for optimizing DNN inference. This framework aims to advance system performance, reduce latency, and mitigate interference effects, thereby contributing to the development of more efficient and scalable AI systems capable of meeting the evolving demands of real-time inference across diverse computational platforms. The thesis addresses several key problem statements, including enabling runtime scheduling of inference pipelines on edge devices, fully online scheduling of inference pipelines on heterogeneous platforms, mitigating interference effects on inference pipelines in inference-serving systems, and optimizing resource allocation in inference-serving systems for adaptive SLO-aware inference serving. The contributions of this thesis are encapsulated in four papers, each focusing on distinct aspects of CNN inference optimization. These contributions include the development of comprehensive frameworks for online scheduling of CNN pipelines, leveraging platform knowledge for expedited seed generation, dynamic scheduling techniques to alleviate interference effects, and SLO-aware scheduling techniques for optimizing resource allocation in inference-serving systems. Through these contributions, this thesis seeks to advance the state-of-the-art in CNN inference optimization and inference-serving systems, paving the way for more efficient and scalable AI systems capable of meeting the demands of real-time inference across diverse computational platforms. |
|---|
Nájsť tento článok vo Web of Science