APapo: An asynchronous parallel optimization method for DNN models

To address the challenges related to segmentation complexity, high memory usage, extended training duration, and low equipment utilization in parallel optimization of large-scale deep neural network (DNN) models, this paper proposes an asynchronous parallel optimization method APapo. Firstly, a mult...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:Future generation computer systems Ročník 152; s. 317 - 330
Hlavní autori: Liu, Shuai, Ju, Tao
Médium: Journal Article
Jazyk:English
Vydavateľské údaje: Elsevier B.V 01.03.2024
Predmet:
ISSN:0167-739X, 1872-7115
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Abstract To address the challenges related to segmentation complexity, high memory usage, extended training duration, and low equipment utilization in parallel optimization of large-scale deep neural network (DNN) models, this paper proposes an asynchronous parallel optimization method APapo. Firstly, a multi-iteration asynchronous pipeline parallel scheduling was established for model parallel computing tasks, controlling the specific scheduling process of micro-batch units to address gradient delay updating during asynchronous iteration. Secondly, combined with the given network model and hardware configuration, a dynamic programming strategy for computing resources and model tasks was designed to achieve dynamic segmentation of model computing tasks and optimal matching of computing resources. Finally, an optimization strategy for runtime scheduling of computing resources and model tasks was developed, using improved device streams to maximize the overlap between computing and communication, thus improving the utilization rate of computing resources and reducing training time. Experimental results show that the APapo method achieves fine-grained task segmentation, maximizes the utilization rate of each GPU computing resource, and on average improves the training speed of large-scale deep neural network models by 2.8 times while maintaining the training accuracy of the model compared to existing parallel optimization methods. •We proposed an improved optimization strategy for parallel task scheduling of pipeline model, established a multi-iteration asynchronous parallel task management mechanism suitable for large-scale model computing tasks, designed the total execution framework of computing resources and model tasks to solve the problem of model partitioning and equipment allocation, and solve the problem of gradient delay updating during asynchronous iteration by controlling the micro-batch unit scheduling process.•We proposed a model segmentation method based on augmented antichain. Through the computational task transformation of large-scale DNN model, we constructed the antichain Directed Acyclic Graph (DAG) state sequence that conforms to the computational iteration specification. On this basis, combined with the characteristics of hardware computing resources, tasks are segmented through dynamic programming to achieve a reasonable match between computing tasks and computing resources.•We designed a runtime scheduling strategy for computing resources and tasks. By optimizing the default stream of devices, the dependency between computing nodes and communication is eliminated to the maximum extent, the overlap between computing and communication is maximized, the utilization rate of computing resources is improved, and the training speed of large-scale deep neural network model is increased while the accuracy of model training is guaranteed.
AbstractList To address the challenges related to segmentation complexity, high memory usage, extended training duration, and low equipment utilization in parallel optimization of large-scale deep neural network (DNN) models, this paper proposes an asynchronous parallel optimization method APapo. Firstly, a multi-iteration asynchronous pipeline parallel scheduling was established for model parallel computing tasks, controlling the specific scheduling process of micro-batch units to address gradient delay updating during asynchronous iteration. Secondly, combined with the given network model and hardware configuration, a dynamic programming strategy for computing resources and model tasks was designed to achieve dynamic segmentation of model computing tasks and optimal matching of computing resources. Finally, an optimization strategy for runtime scheduling of computing resources and model tasks was developed, using improved device streams to maximize the overlap between computing and communication, thus improving the utilization rate of computing resources and reducing training time. Experimental results show that the APapo method achieves fine-grained task segmentation, maximizes the utilization rate of each GPU computing resource, and on average improves the training speed of large-scale deep neural network models by 2.8 times while maintaining the training accuracy of the model compared to existing parallel optimization methods. •We proposed an improved optimization strategy for parallel task scheduling of pipeline model, established a multi-iteration asynchronous parallel task management mechanism suitable for large-scale model computing tasks, designed the total execution framework of computing resources and model tasks to solve the problem of model partitioning and equipment allocation, and solve the problem of gradient delay updating during asynchronous iteration by controlling the micro-batch unit scheduling process.•We proposed a model segmentation method based on augmented antichain. Through the computational task transformation of large-scale DNN model, we constructed the antichain Directed Acyclic Graph (DAG) state sequence that conforms to the computational iteration specification. On this basis, combined with the characteristics of hardware computing resources, tasks are segmented through dynamic programming to achieve a reasonable match between computing tasks and computing resources.•We designed a runtime scheduling strategy for computing resources and tasks. By optimizing the default stream of devices, the dependency between computing nodes and communication is eliminated to the maximum extent, the overlap between computing and communication is maximized, the utilization rate of computing resources is improved, and the training speed of large-scale deep neural network model is increased while the accuracy of model training is guaranteed.
Author Liu, Shuai
Ju, Tao
Author_xml – sequence: 1
  givenname: Shuai
  orcidid: 0009-0006-5398-7955
  surname: Liu
  fullname: Liu, Shuai
– sequence: 2
  givenname: Tao
  orcidid: 0000-0002-5850-4565
  surname: Ju
  fullname: Ju, Tao
  email: jutao@mail.lzjtu.cn
BookMark eNqFkE9LwzAYxoNMcJt-Aw_5Aq15m7RpdxDm_AtjelDwFrI0ZRltUpJMmJ_eznnyoKeXF57fA89vgkbWWY3QJZAUCBRX27TZxZ3XaUYymgKkhLATNIaSZwkHyEdoPMR4wmn1foYmIWwJIcApjNHN_EX2bobnFsuwt2rjnXW7gHvpZdvqFrs-ms58ymicxZ2OG1fjxnl8u1rhztW6DefotJFt0Bc_d4re7u9eF4_J8vnhaTFfJoqSIiYlYVVRUVVywnNVyAIoY5niZM3XxfBIVivWkKoolZIyHyIEKrUua85KynROp4gde5V3IXjdiN6bTvq9ACIOHsRWHD2IgwcBIAYPAzb7hSkTv-dEL037H3x9hIed-sNoL4Iy2ipdG69VFLUzfxd8Abm-fXg
CitedBy_id crossref_primary_10_1016_j_future_2024_107600
Cites_doi 10.1609/aaai.v33i01.33014780
10.1145/3492321.3519584
10.1145/3437801.3441593
10.1109/CVPR.2017.243
10.1109/CVPR.2016.90
10.1145/3458817.3476209
10.1145/3394486.3406703
10.1109/TPDS.2021.3094364
10.1007/s11856-014-1137-5
10.1145/3458817.3476205
10.1016/j.physrep.2009.11.002
10.1007/s41109-020-00255-5
10.1016/j.physa.2022.127097
10.1145/2901318.2901331
10.1109/CVPR.2018.00907
10.1145/3503222.3507778
10.1145/3341301.3359646
ContentType Journal Article
Copyright 2023 Elsevier B.V.
Copyright_xml – notice: 2023 Elsevier B.V.
DBID AAYXX
CITATION
DOI 10.1016/j.future.2023.11.004
DatabaseName CrossRef
DatabaseTitle CrossRef
DatabaseTitleList
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISSN 1872-7115
EndPage 330
ExternalDocumentID 10_1016_j_future_2023_11_004
S0167739X23004041
GroupedDBID --K
--M
-~X
.DC
.~1
0R~
1B1
1~.
1~5
29H
4.4
457
4G.
5GY
5VS
7-5
71M
8P~
9JN
AACTN
AAEDT
AAEDW
AAIAV
AAIKJ
AAKOC
AALRI
AAOAW
AAQFI
AAQXK
AAXUO
AAYFN
ABBOA
ABFNM
ABJNI
ABMAC
ABXDB
ABYKQ
ACDAQ
ACGFS
ACNNM
ACRLP
ACZNC
ADBBV
ADEZE
ADJOM
ADMUD
AEBSH
AEKER
AFKWA
AFTJW
AGHFR
AGUBO
AGYEJ
AHHHB
AHZHX
AIALX
AIEXJ
AIKHN
AITUG
AJOXV
ALMA_UNASSIGNED_HOLDINGS
AMFUW
AMRAJ
AOUOD
ASPBG
AVWKF
AXJTR
AZFZN
BKOJK
BLXMC
CS3
EBS
EFJIC
EJD
EO8
EO9
EP2
EP3
F5P
FDB
FEDTE
FGOYB
FIRID
FNPLU
FYGXN
G-Q
G8K
GBLVA
GBOLZ
HLZ
HVGLF
HZ~
IHE
J1W
KOM
LG9
M41
MO0
MS~
N9A
O-L
O9-
OAUVE
OZT
P-8
P-9
PC.
Q38
R2-
RIG
ROL
RPZ
SBC
SDF
SDG
SES
SEW
SPC
SPCBC
SSV
SSZ
T5K
UHS
WUQ
XPP
ZMT
~G-
9DU
AATTM
AAXKI
AAYWO
AAYXX
ABDPE
ABWVN
ACLOT
ACRPL
ADNMO
AEIPS
AFJKZ
AGQPQ
AIIUN
ANKPU
APXCP
CITATION
EFKBS
EFLBG
~HD
ID FETCH-LOGICAL-c306t-8049693c87075c6a613442c70b7b6613a4dc4f0968ccaa55c6019cb8d74834e53
ISICitedReferencesCount 1
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001112403400001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 0167-739X
IngestDate Sat Nov 29 07:21:57 EST 2025
Tue Nov 18 21:43:44 EST 2025
Sat May 04 15:43:12 EDT 2024
IsPeerReviewed true
IsScholarly true
Keywords Model segmentation
Augmented antichain
Computation–communication overlap
DNN model parallelism
Asynchronous pipeline parallelism
Language English
LinkModel OpenURL
MergedId FETCHMERGED-LOGICAL-c306t-8049693c87075c6a613442c70b7b6613a4dc4f0968ccaa55c6019cb8d74834e53
ORCID 0000-0002-5850-4565
0009-0006-5398-7955
PageCount 14
ParticipantIDs crossref_primary_10_1016_j_future_2023_11_004
crossref_citationtrail_10_1016_j_future_2023_11_004
elsevier_sciencedirect_doi_10_1016_j_future_2023_11_004
PublicationCentury 2000
PublicationDate March 2024
2024-03-00
PublicationDateYYYYMMDD 2024-03-01
PublicationDate_xml – month: 03
  year: 2024
  text: March 2024
PublicationDecade 2020
PublicationTitle Future generation computer systems
PublicationYear 2024
Publisher Elsevier B.V
Publisher_xml – name: Elsevier B.V
References Mirhoseini, Pham, Le, Steiner, Larsen, Zhou, Kumar, Norouzi, Bengio, Dean (b12) 2017
Abhinav Jangda, Jun Huang, Guodong Liu, Amir Hossein Nodehi Sabet, Saeed Maleki, Youshan Miao, Madanlal Musuvathi, Todd Mytkowicz, Olli Saarikivi, Breaking the computation and communication abstraction barrier in distributed machine learning workloads, in: Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2022, pp. 402–416.
Shiqing Fan, Yi Rong, Chen Meng, Zongyan Cao, Siyu Wang, Zhen Zheng, Chuan Wu, Guoping Long, Jun Yang, Lixue Xia, et al., DAPPLE: A pipelined data parallel approach for training large models, in: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2021, pp. 431–445.
Gao Huang, Zhuang Liu, Laurens Van Der Maaten, Kilian Q. Weinberger, Densely connected convolutional networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4700–4708.
Brown, Mann, Ryder, Subbiah, Kaplan, Dhariwal, Neelakantan, Shyam, Sastry, Askell (b5) 2020; 33
Krizhevsky (b11) 2014
Fortunato (b30) 2010; 486
Symposium on Operating Systems Design and Implementation
Jia, Zaharia, Aiken (b13) 2019; 1
Zaremba, Sutskever, Vinyals (b42) 2014
Shazeer, Cheng, Parmar, Tran, Vaswani, Koanantakool, Hawkins, Lee, Hong, Young (b23) 2018; 31
Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin (b16) 2017; 30
Iandola, Han, Moskewicz, Ashraf, Dally, Keutzer (b44) 2016
Huming, Pei, Licheng, Shuyuan, Biao (b10) 2018; 41
Jay H. Park, Gyeongchan Yun, Chang M. Yi, Nguyen T. Nguyen, Seungmin Lee, Jaesik Choi, Sam H Noh, Young-ri Choi, Hetpipe: Enabling large DNN training on (whimpy) heterogeneous GPU clusters through integration of pipelined model parallelism and data parallelism, in: Proceedings of the 2020 USENIX Conference on Usenix Annual Technical Conference, 2020, pp. 307–321.
PyTorch cuda streams
Yuan, Li, Shao, Yan (b29) 2020
Narayanan, Phanishayee, Shi, Chen, Zaharia (b24) 2021
Micikevicius, Narang, Alben, Diamos, Elsen, Garcia, Ginsburg, Houston, Kuchaiev, Venkatesh (b38) 2017
Barret Zoph, Vijay Vasudevan, Jonathon Shlens, Quoc V Le, Learning transferable architectures for scalable image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8697–8710.
Devlin, Chang, Lee, Toutanova (b4) 2018
Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R. Devanur, Gregory R. Ganger, Phillip B. Gibbons, Matei Zaharia, PipeDream: Generalized pipeline parallelism for DNN training, in: Proceedings of the 27th ACM Symposium on Operating Systems Principles, 2019, pp. 1–15.
Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, Bor-Yiing Su, Scaling distributed machine learning with the parameter server, in: 11th
Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al., Efficient large-scale language model training on gpu clusters using megatron-lm, in: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021, pp. 1–15.
Huang, Cheng, Bapna, Firat, Chen, Chen, Lee, Ngiam, Le, Wu (b21) 2019; 32
Li, Zhao, Varma, Salpekar, Noordhuis, Li, Paszke, Smith, Vaughan, Damania (b37) 2020
Li, Zhenhua, Fang, Kai, Yaqian, kun (b41) 2020; 42
Yang, Zhang, Li, Ré, Aberger, De Sa (b19) 2021; 3
Vasiliauskaite, Evans, Expert (b32) 2022; 596
Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, Yuxiong He, Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning, in: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021, pp. 1–14.
Simonyan, Zisserman (b39) 2014
Harlap, Narayanan, Phanishayee, Seshadri, Devanur, Ganger, Gibbons (b28) 2018
14), 2014, pp. 583–598.
Lee, Kim, Zheng, Ho, Gibson, Xing (b8) 2014; 27
Shuai, Dan (b7) 2022; 45
Rajbhandari, Rasley, Ruwase, He (b33) 2020
Kim, Lee, Jeong, Baek, Yoon, Kim, Lim, Kim (b14) 2020
.
Zhao, Li, Chen, Guan, Jiang, Huang, Qing, Wang, Wang, Zhang (b35) 2021; 33
Guan, Yin, Li, Lu (b27) 2019
Hongrui, Guojun, Chengji, Guangming, Zhan, Zhongzhe, Xiaoyang, Xuejun (b1) 2021; 58
Jin Kyu Kim, Qirong Ho, Seunghak Lee, Xun Zheng, Wei Dai, Garth A. Gibson, Eric P. Xing, Strads: A distributed framework for scheduled model parallel machine learning, in: Proceedings of the Eleventh European Conference on Computer Systems, 2016, pp. 1–16.
Bárány, Grinberg (b22) 2015; 206
Sanjith Athlur, Nitika Saran, Muthian Sivathanu, Ramachandran Ramjee, Nipun Kwatra, Varuna: scalable, low-cost training of massive deep learning models, in: Proceedings of the Seventeenth European Conference on Computer Systems, 2022, pp. 472–487.
Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, Yuxiong He, Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters, in: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020, pp. 3505–3506.
Vasiliauskaite, Evans (b31) 2020; 5
Esteban Real, Alok Aggarwal, Yanping Huang, Quoc V. Le, Regularized evolution for image classifier architecture search, in: Proceedings of the Aaai Conference on Artificial Intelligence, Vol. 33, No. 01, 2019, pp. 4780–4789.
Harlap (10.1016/j.future.2023.11.004_b28) 2018
Mirhoseini (10.1016/j.future.2023.11.004_b12) 2017
Devlin (10.1016/j.future.2023.11.004_b4) 2018
Rajbhandari (10.1016/j.future.2023.11.004_b33) 2020
10.1016/j.future.2023.11.004_b40
10.1016/j.future.2023.11.004_b20
Zhao (10.1016/j.future.2023.11.004_b35) 2021; 33
Li (10.1016/j.future.2023.11.004_b41) 2020; 42
10.1016/j.future.2023.11.004_b43
10.1016/j.future.2023.11.004_b45
Huming (10.1016/j.future.2023.11.004_b10) 2018; 41
10.1016/j.future.2023.11.004_b26
Li (10.1016/j.future.2023.11.004_b37) 2020
10.1016/j.future.2023.11.004_b6
10.1016/j.future.2023.11.004_b25
Lee (10.1016/j.future.2023.11.004_b8) 2014; 27
10.1016/j.future.2023.11.004_b9
Bárány (10.1016/j.future.2023.11.004_b22) 2015; 206
Brown (10.1016/j.future.2023.11.004_b5) 2020; 33
Vasiliauskaite (10.1016/j.future.2023.11.004_b31) 2020; 5
Narayanan (10.1016/j.future.2023.11.004_b24) 2021
10.1016/j.future.2023.11.004_b2
Kim (10.1016/j.future.2023.11.004_b14) 2020
Simonyan (10.1016/j.future.2023.11.004_b39) 2014
10.1016/j.future.2023.11.004_b3
Vasiliauskaite (10.1016/j.future.2023.11.004_b32) 2022; 596
Micikevicius (10.1016/j.future.2023.11.004_b38) 2017
Vaswani (10.1016/j.future.2023.11.004_b16) 2017; 30
Huang (10.1016/j.future.2023.11.004_b21) 2019; 32
Guan (10.1016/j.future.2023.11.004_b27) 2019
Yang (10.1016/j.future.2023.11.004_b19) 2021; 3
10.1016/j.future.2023.11.004_b34
10.1016/j.future.2023.11.004_b15
Shazeer (10.1016/j.future.2023.11.004_b23) 2018; 31
10.1016/j.future.2023.11.004_b36
10.1016/j.future.2023.11.004_b17
Krizhevsky (10.1016/j.future.2023.11.004_b11) 2014
Yuan (10.1016/j.future.2023.11.004_b29) 2020
10.1016/j.future.2023.11.004_b18
Hongrui (10.1016/j.future.2023.11.004_b1) 2021; 58
Jia (10.1016/j.future.2023.11.004_b13) 2019; 1
Shuai (10.1016/j.future.2023.11.004_b7) 2022; 45
Zaremba (10.1016/j.future.2023.11.004_b42) 2014
Fortunato (10.1016/j.future.2023.11.004_b30) 2010; 486
Iandola (10.1016/j.future.2023.11.004_b44) 2016
References_xml – reference: Barret Zoph, Vijay Vasudevan, Jonathon Shlens, Quoc V Le, Learning transferable architectures for scalable image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8697–8710.
– year: 2014
  ident: b42
  article-title: Recurrent neural network regularization
– volume: 41
  start-page: 1861
  year: 2018
  end-page: 1881
  ident: b10
  article-title: Review of parallel deep neural network
  publication-title: Chinese J. Comput.
– year: 2020
  ident: b14
  article-title: Torchgpipe: On-the-fly pipeline parallelism for training giant models
– reference: Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
– volume: 45
  start-page: 28
  year: 2022
  ident: b7
  article-title: Research progress on network performance optimization of distributed machine learning system
  publication-title: Chinese J. Comput.
– volume: 27
  year: 2014
  ident: b8
  article-title: On model parallelization and scheduling strategies for distributed machine learning
  publication-title: Adv. Neural Inf. Process. Syst.
– volume: 31
  year: 2018
  ident: b23
  article-title: Mesh-tensorflow: Deep learning for supercomputers
  publication-title: Adv. Neural Inf. Process. Syst.
– year: 2020
  ident: b37
  article-title: Pytorch distributed: Experiences on accelerating data parallel training
– volume: 486
  start-page: 75
  year: 2010
  end-page: 174
  ident: b30
  article-title: Community detection in graphs
  publication-title: Phys. Rep.
– year: 2014
  ident: b11
  article-title: One weird trick for parallelizing convolutional neural networks
– year: 2017
  ident: b38
  article-title: Mixed precision training
– volume: 596
  year: 2022
  ident: b32
  article-title: Cycle analysis of directed acyclic graphs
  publication-title: Physica A
– reference: } Symposium on Operating Systems Design and Implementation ({
– reference: Gao Huang, Zhuang Liu, Laurens Van Der Maaten, Kilian Q. Weinberger, Densely connected convolutional networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4700–4708.
– volume: 33
  start-page: 1877
  year: 2020
  end-page: 1901
  ident: b5
  article-title: Language models are few-shot learners
  publication-title: Adv. Neural Inf. Process. Syst.
– volume: 42
  start-page: 1529
  year: 2020
  end-page: 1537
  ident: b41
  article-title: An automatic model splitting strategy generationmethod for model parallel training
  publication-title: Comput. Eng. Sci.
– volume: 3
  start-page: 269
  year: 2021
  end-page: 296
  ident: b19
  article-title: Pipemare: Asynchronous pipeline parallel dnn training
  publication-title: Proc. Mach. Learn. Syst.
– reference: Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, Yuxiong He, Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters, in: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020, pp. 3505–3506.
– year: 2018
  ident: b28
  article-title: Pipedream: Fast and efficient pipeline parallel dnn training
– reference: Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, Bor-Yiing Su, Scaling distributed machine learning with the parameter server, in: 11th {
– volume: 5
  year: 2020
  ident: b31
  article-title: Making communities show respect for order
  publication-title: Appl. Netw. Sci.
– start-page: 2430
  year: 2017
  end-page: 2439
  ident: b12
  article-title: Device placement optimization with reinforcement learning
  publication-title: International Conference on Machine Learning
– volume: 206
  start-page: 155
  year: 2015
  end-page: 164
  ident: b22
  article-title: Block partitions of sequences
  publication-title: Israel J. Math.
– reference: Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R. Devanur, Gregory R. Ganger, Phillip B. Gibbons, Matei Zaharia, PipeDream: Generalized pipeline parallelism for DNN training, in: Proceedings of the 27th ACM Symposium on Operating Systems Principles, 2019, pp. 1–15.
– volume: 32
  year: 2019
  ident: b21
  article-title: Gpipe: Efficient training of giant neural networks using pipeline parallelism
  publication-title: Adv. Neural Inf. Process. Syst.
– start-page: 7937
  year: 2021
  end-page: 7947
  ident: b24
  article-title: Memory-efficient pipeline-parallel dnn training
  publication-title: International Conference on Machine Learning
– year: 2018
  ident: b4
  article-title: Bert: Pre-training of deep bidirectional transformers for language understanding
– reference: .
– start-page: 737
  year: 2020
  end-page: 753
  ident: b29
  article-title: Learning connectivity of neural networks from a topological perspective
  publication-title: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI
– volume: 58
  start-page: 98
  year: 2021
  end-page: 115
  ident: b1
  article-title: Survey on network of distributed deep learning training
  publication-title: J. Comput. Res. Develop.
– reference: } 14), 2014, pp. 583–598.
– volume: 33
  start-page: 489
  year: 2021
  end-page: 506
  ident: b35
  article-title: Vpipe: A virtualized acceleration system for achieving efficient and scalable pipeline parallel dnn training
  publication-title: IEEE Trans. Parallel Distrib. Syst.
– reference: Abhinav Jangda, Jun Huang, Guodong Liu, Amir Hossein Nodehi Sabet, Saeed Maleki, Youshan Miao, Madanlal Musuvathi, Todd Mytkowicz, Olli Saarikivi, Breaking the computation and communication abstraction barrier in distributed machine learning workloads, in: Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2022, pp. 402–416.
– year: 2019
  ident: b27
  article-title: Xpipe: Efficient pipeline model parallelism for multi-GPU DNN training
– year: 2016
  ident: b44
  article-title: SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and¡ 0.5 MB model size
– reference: Shiqing Fan, Yi Rong, Chen Meng, Zongyan Cao, Siyu Wang, Zhen Zheng, Chuan Wu, Guoping Long, Jun Yang, Lixue Xia, et al., DAPPLE: A pipelined data parallel approach for training large models, in: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2021, pp. 431–445.
– reference: Esteban Real, Alok Aggarwal, Yanping Huang, Quoc V. Le, Regularized evolution for image classifier architecture search, in: Proceedings of the Aaai Conference on Artificial Intelligence, Vol. 33, No. 01, 2019, pp. 4780–4789.
– reference: Jin Kyu Kim, Qirong Ho, Seunghak Lee, Xun Zheng, Wei Dai, Garth A. Gibson, Eric P. Xing, Strads: A distributed framework for scheduled model parallel machine learning, in: Proceedings of the Eleventh European Conference on Computer Systems, 2016, pp. 1–16.
– volume: 1
  start-page: 1
  year: 2019
  end-page: 13
  ident: b13
  article-title: Beyond data and model parallelism for deep neural networks.
  publication-title: Proc. Mach. Learn. Syst.
– reference: Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al., Efficient large-scale language model training on gpu clusters using megatron-lm, in: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021, pp. 1–15.
– reference: Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, Yuxiong He, Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning, in: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021, pp. 1–14.
– volume: 30
  year: 2017
  ident: b16
  article-title: Attention is all you need
  publication-title: Adv. Neural Inf. Process. Syst.
– reference: Jay H. Park, Gyeongchan Yun, Chang M. Yi, Nguyen T. Nguyen, Seungmin Lee, Jaesik Choi, Sam H Noh, Young-ri Choi, Hetpipe: Enabling large DNN training on (whimpy) heterogeneous GPU clusters through integration of pipelined model parallelism and data parallelism, in: Proceedings of the 2020 USENIX Conference on Usenix Annual Technical Conference, 2020, pp. 307–321.
– reference: Sanjith Athlur, Nitika Saran, Muthian Sivathanu, Ramachandran Ramjee, Nipun Kwatra, Varuna: scalable, low-cost training of massive deep learning models, in: Proceedings of the Seventeenth European Conference on Computer Systems, 2022, pp. 472–487.
– reference: PyTorch cuda streams,
– year: 2014
  ident: b39
  article-title: Very deep convolutional networks for large-scale image recognition
– start-page: 1
  year: 2020
  end-page: 16
  ident: b33
  article-title: Zero: Memory optimizations toward training trillion parameter models
  publication-title: SC20: International Conference for High Performance Computing, Networking, Storage and Analysis
– volume: 30
  year: 2017
  ident: 10.1016/j.future.2023.11.004_b16
  article-title: Attention is all you need
  publication-title: Adv. Neural Inf. Process. Syst.
– year: 2016
  ident: 10.1016/j.future.2023.11.004_b44
– ident: 10.1016/j.future.2023.11.004_b2
  doi: 10.1609/aaai.v33i01.33014780
– volume: 3
  start-page: 269
  year: 2021
  ident: 10.1016/j.future.2023.11.004_b19
  article-title: Pipemare: Asynchronous pipeline parallel dnn training
  publication-title: Proc. Mach. Learn. Syst.
– ident: 10.1016/j.future.2023.11.004_b34
  doi: 10.1145/3492321.3519584
– year: 2020
  ident: 10.1016/j.future.2023.11.004_b37
– volume: 33
  start-page: 1877
  year: 2020
  ident: 10.1016/j.future.2023.11.004_b5
  article-title: Language models are few-shot learners
  publication-title: Adv. Neural Inf. Process. Syst.
– ident: 10.1016/j.future.2023.11.004_b25
  doi: 10.1145/3437801.3441593
– ident: 10.1016/j.future.2023.11.004_b43
  doi: 10.1109/CVPR.2017.243
– ident: 10.1016/j.future.2023.11.004_b40
  doi: 10.1109/CVPR.2016.90
– year: 2017
  ident: 10.1016/j.future.2023.11.004_b38
– volume: 41
  start-page: 1861
  issue: 8
  year: 2018
  ident: 10.1016/j.future.2023.11.004_b10
  article-title: Review of parallel deep neural network
  publication-title: Chinese J. Comput.
– year: 2019
  ident: 10.1016/j.future.2023.11.004_b27
– year: 2014
  ident: 10.1016/j.future.2023.11.004_b39
– ident: 10.1016/j.future.2023.11.004_b15
  doi: 10.1145/3458817.3476209
– volume: 31
  year: 2018
  ident: 10.1016/j.future.2023.11.004_b23
  article-title: Mesh-tensorflow: Deep learning for supercomputers
  publication-title: Adv. Neural Inf. Process. Syst.
– ident: 10.1016/j.future.2023.11.004_b20
  doi: 10.1145/3394486.3406703
– volume: 32
  year: 2019
  ident: 10.1016/j.future.2023.11.004_b21
  article-title: Gpipe: Efficient training of giant neural networks using pipeline parallelism
  publication-title: Adv. Neural Inf. Process. Syst.
– volume: 33
  start-page: 489
  issue: 3
  year: 2021
  ident: 10.1016/j.future.2023.11.004_b35
  article-title: Vpipe: A virtualized acceleration system for achieving efficient and scalable pipeline parallel dnn training
  publication-title: IEEE Trans. Parallel Distrib. Syst.
  doi: 10.1109/TPDS.2021.3094364
– volume: 58
  start-page: 98
  issue: 01
  year: 2021
  ident: 10.1016/j.future.2023.11.004_b1
  article-title: Survey on network of distributed deep learning training
  publication-title: J. Comput. Res. Develop.
– year: 2020
  ident: 10.1016/j.future.2023.11.004_b14
– volume: 206
  start-page: 155
  year: 2015
  ident: 10.1016/j.future.2023.11.004_b22
  article-title: Block partitions of sequences
  publication-title: Israel J. Math.
  doi: 10.1007/s11856-014-1137-5
– ident: 10.1016/j.future.2023.11.004_b36
– volume: 1
  start-page: 1
  year: 2019
  ident: 10.1016/j.future.2023.11.004_b13
  article-title: Beyond data and model parallelism for deep neural networks.
  publication-title: Proc. Mach. Learn. Syst.
– ident: 10.1016/j.future.2023.11.004_b45
  doi: 10.1145/3458817.3476205
– year: 2014
  ident: 10.1016/j.future.2023.11.004_b11
– volume: 45
  start-page: 28
  issue: 7
  year: 2022
  ident: 10.1016/j.future.2023.11.004_b7
  article-title: Research progress on network performance optimization of distributed machine learning system
  publication-title: Chinese J. Comput.
– volume: 486
  start-page: 75
  issue: 3–5
  year: 2010
  ident: 10.1016/j.future.2023.11.004_b30
  article-title: Community detection in graphs
  publication-title: Phys. Rep.
  doi: 10.1016/j.physrep.2009.11.002
– volume: 5
  issue: 1
  year: 2020
  ident: 10.1016/j.future.2023.11.004_b31
  article-title: Making communities show respect for order
  publication-title: Appl. Netw. Sci.
  doi: 10.1007/s41109-020-00255-5
– volume: 596
  year: 2022
  ident: 10.1016/j.future.2023.11.004_b32
  article-title: Cycle analysis of directed acyclic graphs
  publication-title: Physica A
  doi: 10.1016/j.physa.2022.127097
– start-page: 737
  year: 2020
  ident: 10.1016/j.future.2023.11.004_b29
  article-title: Learning connectivity of neural networks from a topological perspective
– ident: 10.1016/j.future.2023.11.004_b9
  doi: 10.1145/2901318.2901331
– volume: 27
  year: 2014
  ident: 10.1016/j.future.2023.11.004_b8
  article-title: On model parallelization and scheduling strategies for distributed machine learning
  publication-title: Adv. Neural Inf. Process. Syst.
– ident: 10.1016/j.future.2023.11.004_b3
  doi: 10.1109/CVPR.2018.00907
– ident: 10.1016/j.future.2023.11.004_b6
– start-page: 2430
  year: 2017
  ident: 10.1016/j.future.2023.11.004_b12
  article-title: Device placement optimization with reinforcement learning
– start-page: 7937
  year: 2021
  ident: 10.1016/j.future.2023.11.004_b24
  article-title: Memory-efficient pipeline-parallel dnn training
– ident: 10.1016/j.future.2023.11.004_b26
  doi: 10.1145/3503222.3507778
– ident: 10.1016/j.future.2023.11.004_b17
  doi: 10.1145/3341301.3359646
– ident: 10.1016/j.future.2023.11.004_b18
– year: 2018
  ident: 10.1016/j.future.2023.11.004_b28
– year: 2018
  ident: 10.1016/j.future.2023.11.004_b4
– volume: 42
  start-page: 1529
  issue: 9
  year: 2020
  ident: 10.1016/j.future.2023.11.004_b41
  article-title: An automatic model splitting strategy generationmethod for model parallel training
  publication-title: Comput. Eng. Sci.
– year: 2014
  ident: 10.1016/j.future.2023.11.004_b42
– start-page: 1
  year: 2020
  ident: 10.1016/j.future.2023.11.004_b33
  article-title: Zero: Memory optimizations toward training trillion parameter models
SSID ssj0001731
Score 2.4112895
Snippet To address the challenges related to segmentation complexity, high memory usage, extended training duration, and low equipment utilization in parallel...
SourceID crossref
elsevier
SourceType Enrichment Source
Index Database
Publisher
StartPage 317
SubjectTerms Asynchronous pipeline parallelism
Augmented antichain
Computation–communication overlap
DNN model parallelism
Model segmentation
Title APapo: An asynchronous parallel optimization method for DNN models
URI https://dx.doi.org/10.1016/j.future.2023.11.004
Volume 152
WOSCitedRecordID wos001112403400001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVESC
  databaseName: Elsevier SD Freedom Collection Journals 2021
  customDbUrl:
  eissn: 1872-7115
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0001731
  issn: 0167-739X
  databaseCode: AIEXJ
  dateStart: 19950201
  isFulltext: true
  titleUrlDefault: https://www.sciencedirect.com
  providerName: Elsevier
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV3JTsMwELVQ4cCFHbHLB66pSuzECbcCrQChikOReoscxxVFJa1oi-DvGduTNCxiOXCJKitx2jxn_Dwdv0fIcahhmvYz5oUw_Xv8pJ96UknpZQ3W51HsSy0yazYhOp2o14tv0eJ-Yu0ERJ5HLy_x-F-hhjYA22yd_QPcZafQAJ8BdDgC7HD8FfDNWzkeYb5PTl5zZeRvTaGrUfkeDjWQTwgTj7j_Ei2kbbXhRafjnHEmVcratqojxmpZ42hR6ASBMtAlK78ZzGwy9X4mB2Vljm3qylE1veDzeX1VkXGESCqY9budh0ynOotBj7ndlzh_Mvc_y6fQ7LIED3WnlVI3tu11o5_q3IffK2F_mKHKusGiJO0hcb0kphdYxCRWEnbRF0Ec1chi86rVuy7n4xOBrpT4Q4oNlLbK7_O3-ZqgVEhHd42s4GqBNh3K62RB5xtktXDioBiYN8mZBf2UNnNahZwWkNMq5NRBTgFyCpBTB_kWuWu3uueXHrpjeAqWeVOgFjwOY6Yg4IpAhRJ4Gee-Eo1UpEC6mOSZ4n1YoUbwksoATgE2r9IoEyZ_rAO2TWr5KNc7hKZh2sj8RpwFvuLQVSoyFoWayVTFWgThLmHFI0kUSscbB5Nh8h0gu8Qrrxo76ZQfzhfF006Q_jlal8AQ-vbKvT_eaZ8sz4f6AalNn2b6kCyp5-lg8nSE4-cNPsp9ow
linkProvider Elsevier
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=APapo%3A+An+asynchronous+parallel+optimization+method+for+DNN+models&rft.jtitle=Future+generation+computer+systems&rft.au=Liu%2C+Shuai&rft.au=Ju%2C+Tao&rft.date=2024-03-01&rft.issn=0167-739X&rft.volume=152&rft.spage=317&rft.epage=330&rft_id=info:doi/10.1016%2Fj.future.2023.11.004&rft.externalDBID=n%2Fa&rft.externalDocID=10_1016_j_future_2023_11_004
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0167-739X&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0167-739X&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0167-739X&client=summon