Distributed Model Training Task Migration for Hotspot Management in Intelligent Computing Center Interconnection with Tidal Characteristics
Saved in:
| Title: | Distributed Model Training Task Migration for Hotspot Management in Intelligent Computing Center Interconnection with Tidal Characteristics |
|---|---|
| Authors: | Fan, Yingbo, Li, Yajie, Natalino Da Silva, Carlos, 1987, Wang, Yahui, Guo, Jiaxing, Wu, Wanping, Ruan, Rongrong, Wang, Wei, Zhao, Yongli, Zhang, Jie |
| Source: | IEEE Transactions on Network and Service Management. In Press |
| Subject Terms: | hotspot management, tidal effect, distributed model training, task migration, Intelligent computing center interconnections |
| Description: | Intelligent computing center (ICC) is a new type of data center constructed with intelligent computing power, such as graphic processing units (GPUs) and artificial intelligence acceleration cards. With billions of parameters, the emergence of large models (e.g., ChatGPT) presents a significant demand of computing power. It may be challenging for a single ICC to provide the required computing power during large model training. Thus, ICC interconnections (ICCI) will become a typical and effective solution to provide intensive computing power. Due to human activities, traditional computing tasks (e.g., transaction processing and online entertainment) exhibit a tidal effect of computing demand, which leads to the tidal variation of remaining computing resources. Moreover, distributed model training (DMT) tasks are likely to cover peaks and valleys of the tidal effect in computing power. In this case, it is easy for DMT tasks to cause an ICC to become a hotspot (i.e., computing load in an ICC exceeds a desired threshold), which significantly degrades the reliability and performance of the ICC. This paper proposes DeepHM, a deep reinforcement learning-based hotspot management strategy through task migration in ICCI networks. To comprehensively consider the bandwidth metrics of the ICCI network, we further propose a dynamic wavelength allocation strategy, i.e., DeepHM-DWA. Simulation results show that the DeepHM and DeepHM-DWA reduce the hotspot compute unit time blocks by 19% and 18% with fewer number of migrated workers while balancing the computing load among multiple ICCs. DeepHM and DeepHM-DWA reduce the average completion time ratio of the DMT tasks by 2% and 5%, respectively. |
| File Description: | electronic |
| Access URL: | https://research.chalmers.se/publication/547622 https://research.chalmers.se/publication/547622/file/547622_Fulltext.pdf |
| Database: | SwePub |
| Abstract: | Intelligent computing center (ICC) is a new type of data center constructed with intelligent computing power, such as graphic processing units (GPUs) and artificial intelligence acceleration cards. With billions of parameters, the emergence of large models (e.g., ChatGPT) presents a significant demand of computing power. It may be challenging for a single ICC to provide the required computing power during large model training. Thus, ICC interconnections (ICCI) will become a typical and effective solution to provide intensive computing power. Due to human activities, traditional computing tasks (e.g., transaction processing and online entertainment) exhibit a tidal effect of computing demand, which leads to the tidal variation of remaining computing resources. Moreover, distributed model training (DMT) tasks are likely to cover peaks and valleys of the tidal effect in computing power. In this case, it is easy for DMT tasks to cause an ICC to become a hotspot (i.e., computing load in an ICC exceeds a desired threshold), which significantly degrades the reliability and performance of the ICC. This paper proposes DeepHM, a deep reinforcement learning-based hotspot management strategy through task migration in ICCI networks. To comprehensively consider the bandwidth metrics of the ICCI network, we further propose a dynamic wavelength allocation strategy, i.e., DeepHM-DWA. Simulation results show that the DeepHM and DeepHM-DWA reduce the hotspot compute unit time blocks by 19% and 18% with fewer number of migrated workers while balancing the computing load among multiple ICCs. DeepHM and DeepHM-DWA reduce the average completion time ratio of the DMT tasks by 2% and 5%, respectively. |
|---|---|
| ISSN: | 19324537 |
| DOI: | 10.1109/TNSM.2025.3590011 |
Full Text Finder
Nájsť tento článok vo Web of Science