Optimizing depthwise separable convolution on DCU

Uložené v:
Podrobná bibliografia
Názov: Optimizing depthwise separable convolution on DCU
Autori: Zheng Liu, Meng Hao, Weizhe Zhang, Gangzhao Lu, Xueyang Tian, Siyu Yang, Mingdong Xie, Jie Dai, Chenyu Yuan, Desheng Wang, Hongwei Yang
Zdroj: CCF Transactions on High Performance Computing. 6:646-664
Informácie o vydavateľovi: Springer Science and Business Media LLC, 2024.
Rok vydania: 2024
Predmety: 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, 02 engineering and technology, 01 natural sciences
Popis: The integration of Large Language Models (LLMs) with Convolutional Neural Networks (CNNs) is significantly advancing the development of large models. However, the computational cost of large models is high, necessitating optimization for greater efficiency. One effective way to optimize the CNN is the use of depthwise separable convolution (DSC), which decouples spatial and channel convolutions to reduce the number of parameters and enhance efficiency. In this study, we focus on porting and optimizing DSC kernel functions from the GPU to the Deep Computing Unit (DCU), a computing accelerator developed in China. For depthwise convolution, we implement a row data reuse algorithm to minimize redundant data loading and memory access overhead. For pointwise convolution, we extend our dynamic tiling strategy to improve hardware utilization by balancing resource allocation among blocks and threads, and we enhance arithmetic intensity through a channel distribution algorithm. We implement depthwise and pointwise convolution kernel functions and integrate them into PyTorch as extension modules. Experiments demonstrate that our optimized kernel functions outperform the MIOpen library on the DCU, achieving up to a 3.59 $$\times$$ × speedup in depthwise convolution and up to a 3.54 $$\times$$ × speedup in pointwise convolution. These results highlight the effectiveness of our approach in leveraging the DCU’s architecture to accelerate deep learning operations.
Druh dokumentu: Article
Jazyk: English
ISSN: 2524-4930
2524-4922
DOI: 10.1007/s42514-024-00200-3
Rights: CC BY
Prístupové číslo: edsair.doi...........4ffb868df6bec042db727f0dd3e699b9
Databáza: OpenAIRE
Popis
Abstrakt:The integration of Large Language Models (LLMs) with Convolutional Neural Networks (CNNs) is significantly advancing the development of large models. However, the computational cost of large models is high, necessitating optimization for greater efficiency. One effective way to optimize the CNN is the use of depthwise separable convolution (DSC), which decouples spatial and channel convolutions to reduce the number of parameters and enhance efficiency. In this study, we focus on porting and optimizing DSC kernel functions from the GPU to the Deep Computing Unit (DCU), a computing accelerator developed in China. For depthwise convolution, we implement a row data reuse algorithm to minimize redundant data loading and memory access overhead. For pointwise convolution, we extend our dynamic tiling strategy to improve hardware utilization by balancing resource allocation among blocks and threads, and we enhance arithmetic intensity through a channel distribution algorithm. We implement depthwise and pointwise convolution kernel functions and integrate them into PyTorch as extension modules. Experiments demonstrate that our optimized kernel functions outperform the MIOpen library on the DCU, achieving up to a 3.59 $$\times$$ × speedup in depthwise convolution and up to a 3.54 $$\times$$ × speedup in pointwise convolution. These results highlight the effectiveness of our approach in leveraging the DCU’s architecture to accelerate deep learning operations.
ISSN:25244930
25244922
DOI:10.1007/s42514-024-00200-3