APapo: An asynchronous parallel optimization method for DNN models
To address the challenges related to segmentation complexity, high memory usage, extended training duration, and low equipment utilization in parallel optimization of large-scale deep neural network (DNN) models, this paper proposes an asynchronous parallel optimization method APapo. Firstly, a mult...
Uložené v:
| Vydané v: | Future generation computer systems Ročník 152; s. 317 - 330 |
|---|---|
| Hlavní autori: | , |
| Médium: | Journal Article |
| Jazyk: | English |
| Vydavateľské údaje: |
Elsevier B.V
01.03.2024
|
| Predmet: | |
| ISSN: | 0167-739X, 1872-7115 |
| On-line prístup: | Získať plný text |
| Tagy: |
Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
|
| Abstract | To address the challenges related to segmentation complexity, high memory usage, extended training duration, and low equipment utilization in parallel optimization of large-scale deep neural network (DNN) models, this paper proposes an asynchronous parallel optimization method APapo. Firstly, a multi-iteration asynchronous pipeline parallel scheduling was established for model parallel computing tasks, controlling the specific scheduling process of micro-batch units to address gradient delay updating during asynchronous iteration. Secondly, combined with the given network model and hardware configuration, a dynamic programming strategy for computing resources and model tasks was designed to achieve dynamic segmentation of model computing tasks and optimal matching of computing resources. Finally, an optimization strategy for runtime scheduling of computing resources and model tasks was developed, using improved device streams to maximize the overlap between computing and communication, thus improving the utilization rate of computing resources and reducing training time. Experimental results show that the APapo method achieves fine-grained task segmentation, maximizes the utilization rate of each GPU computing resource, and on average improves the training speed of large-scale deep neural network models by 2.8 times while maintaining the training accuracy of the model compared to existing parallel optimization methods.
•We proposed an improved optimization strategy for parallel task scheduling of pipeline model, established a multi-iteration asynchronous parallel task management mechanism suitable for large-scale model computing tasks, designed the total execution framework of computing resources and model tasks to solve the problem of model partitioning and equipment allocation, and solve the problem of gradient delay updating during asynchronous iteration by controlling the micro-batch unit scheduling process.•We proposed a model segmentation method based on augmented antichain. Through the computational task transformation of large-scale DNN model, we constructed the antichain Directed Acyclic Graph (DAG) state sequence that conforms to the computational iteration specification. On this basis, combined with the characteristics of hardware computing resources, tasks are segmented through dynamic programming to achieve a reasonable match between computing tasks and computing resources.•We designed a runtime scheduling strategy for computing resources and tasks. By optimizing the default stream of devices, the dependency between computing nodes and communication is eliminated to the maximum extent, the overlap between computing and communication is maximized, the utilization rate of computing resources is improved, and the training speed of large-scale deep neural network model is increased while the accuracy of model training is guaranteed. |
|---|---|
| AbstractList | To address the challenges related to segmentation complexity, high memory usage, extended training duration, and low equipment utilization in parallel optimization of large-scale deep neural network (DNN) models, this paper proposes an asynchronous parallel optimization method APapo. Firstly, a multi-iteration asynchronous pipeline parallel scheduling was established for model parallel computing tasks, controlling the specific scheduling process of micro-batch units to address gradient delay updating during asynchronous iteration. Secondly, combined with the given network model and hardware configuration, a dynamic programming strategy for computing resources and model tasks was designed to achieve dynamic segmentation of model computing tasks and optimal matching of computing resources. Finally, an optimization strategy for runtime scheduling of computing resources and model tasks was developed, using improved device streams to maximize the overlap between computing and communication, thus improving the utilization rate of computing resources and reducing training time. Experimental results show that the APapo method achieves fine-grained task segmentation, maximizes the utilization rate of each GPU computing resource, and on average improves the training speed of large-scale deep neural network models by 2.8 times while maintaining the training accuracy of the model compared to existing parallel optimization methods.
•We proposed an improved optimization strategy for parallel task scheduling of pipeline model, established a multi-iteration asynchronous parallel task management mechanism suitable for large-scale model computing tasks, designed the total execution framework of computing resources and model tasks to solve the problem of model partitioning and equipment allocation, and solve the problem of gradient delay updating during asynchronous iteration by controlling the micro-batch unit scheduling process.•We proposed a model segmentation method based on augmented antichain. Through the computational task transformation of large-scale DNN model, we constructed the antichain Directed Acyclic Graph (DAG) state sequence that conforms to the computational iteration specification. On this basis, combined with the characteristics of hardware computing resources, tasks are segmented through dynamic programming to achieve a reasonable match between computing tasks and computing resources.•We designed a runtime scheduling strategy for computing resources and tasks. By optimizing the default stream of devices, the dependency between computing nodes and communication is eliminated to the maximum extent, the overlap between computing and communication is maximized, the utilization rate of computing resources is improved, and the training speed of large-scale deep neural network model is increased while the accuracy of model training is guaranteed. |
| Author | Liu, Shuai Ju, Tao |
| Author_xml | – sequence: 1 givenname: Shuai orcidid: 0009-0006-5398-7955 surname: Liu fullname: Liu, Shuai – sequence: 2 givenname: Tao orcidid: 0000-0002-5850-4565 surname: Ju fullname: Ju, Tao email: jutao@mail.lzjtu.cn |
| BookMark | eNqFkE9LwzAYxoNMcJt-Aw_5Aq15m7RpdxDm_AtjelDwFrI0ZRltUpJMmJ_eznnyoKeXF57fA89vgkbWWY3QJZAUCBRX27TZxZ3XaUYymgKkhLATNIaSZwkHyEdoPMR4wmn1foYmIWwJIcApjNHN_EX2bobnFsuwt2rjnXW7gHvpZdvqFrs-ms58ymicxZ2OG1fjxnl8u1rhztW6DefotJFt0Bc_d4re7u9eF4_J8vnhaTFfJoqSIiYlYVVRUVVywnNVyAIoY5niZM3XxfBIVivWkKoolZIyHyIEKrUua85KynROp4gde5V3IXjdiN6bTvq9ACIOHsRWHD2IgwcBIAYPAzb7hSkTv-dEL037H3x9hIed-sNoL4Iy2ipdG69VFLUzfxd8Abm-fXg |
| CitedBy_id | crossref_primary_10_1016_j_future_2024_107600 |
| Cites_doi | 10.1609/aaai.v33i01.33014780 10.1145/3492321.3519584 10.1145/3437801.3441593 10.1109/CVPR.2017.243 10.1109/CVPR.2016.90 10.1145/3458817.3476209 10.1145/3394486.3406703 10.1109/TPDS.2021.3094364 10.1007/s11856-014-1137-5 10.1145/3458817.3476205 10.1016/j.physrep.2009.11.002 10.1007/s41109-020-00255-5 10.1016/j.physa.2022.127097 10.1145/2901318.2901331 10.1109/CVPR.2018.00907 10.1145/3503222.3507778 10.1145/3341301.3359646 |
| ContentType | Journal Article |
| Copyright | 2023 Elsevier B.V. |
| Copyright_xml | – notice: 2023 Elsevier B.V. |
| DBID | AAYXX CITATION |
| DOI | 10.1016/j.future.2023.11.004 |
| DatabaseName | CrossRef |
| DatabaseTitle | CrossRef |
| DatabaseTitleList | |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Computer Science |
| EISSN | 1872-7115 |
| EndPage | 330 |
| ExternalDocumentID | 10_1016_j_future_2023_11_004 S0167739X23004041 |
| GroupedDBID | --K --M -~X .DC .~1 0R~ 1B1 1~. 1~5 29H 4.4 457 4G. 5GY 5VS 7-5 71M 8P~ 9JN AACTN AAEDT AAEDW AAIAV AAIKJ AAKOC AALRI AAOAW AAQFI AAQXK AAXUO AAYFN ABBOA ABFNM ABJNI ABMAC ABXDB ABYKQ ACDAQ ACGFS ACNNM ACRLP ACZNC ADBBV ADEZE ADJOM ADMUD AEBSH AEKER AFKWA AFTJW AGHFR AGUBO AGYEJ AHHHB AHZHX AIALX AIEXJ AIKHN AITUG AJOXV ALMA_UNASSIGNED_HOLDINGS AMFUW AMRAJ AOUOD ASPBG AVWKF AXJTR AZFZN BKOJK BLXMC CS3 EBS EFJIC EJD EO8 EO9 EP2 EP3 F5P FDB FEDTE FGOYB FIRID FNPLU FYGXN G-Q G8K GBLVA GBOLZ HLZ HVGLF HZ~ IHE J1W KOM LG9 M41 MO0 MS~ N9A O-L O9- OAUVE OZT P-8 P-9 PC. Q38 R2- RIG ROL RPZ SBC SDF SDG SES SEW SPC SPCBC SSV SSZ T5K UHS WUQ XPP ZMT ~G- 9DU AATTM AAXKI AAYWO AAYXX ABDPE ABWVN ACLOT ACRPL ADNMO AEIPS AFJKZ AGQPQ AIIUN ANKPU APXCP CITATION EFKBS EFLBG ~HD |
| ID | FETCH-LOGICAL-c306t-8049693c87075c6a613442c70b7b6613a4dc4f0968ccaa55c6019cb8d74834e53 |
| ISICitedReferencesCount | 1 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001112403400001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 0167-739X |
| IngestDate | Sat Nov 29 07:21:57 EST 2025 Tue Nov 18 21:43:44 EST 2025 Sat May 04 15:43:12 EDT 2024 |
| IsPeerReviewed | true |
| IsScholarly | true |
| Keywords | Model segmentation Augmented antichain Computation–communication overlap DNN model parallelism Asynchronous pipeline parallelism |
| Language | English |
| LinkModel | OpenURL |
| MergedId | FETCHMERGED-LOGICAL-c306t-8049693c87075c6a613442c70b7b6613a4dc4f0968ccaa55c6019cb8d74834e53 |
| ORCID | 0000-0002-5850-4565 0009-0006-5398-7955 |
| PageCount | 14 |
| ParticipantIDs | crossref_primary_10_1016_j_future_2023_11_004 crossref_citationtrail_10_1016_j_future_2023_11_004 elsevier_sciencedirect_doi_10_1016_j_future_2023_11_004 |
| PublicationCentury | 2000 |
| PublicationDate | March 2024 2024-03-00 |
| PublicationDateYYYYMMDD | 2024-03-01 |
| PublicationDate_xml | – month: 03 year: 2024 text: March 2024 |
| PublicationDecade | 2020 |
| PublicationTitle | Future generation computer systems |
| PublicationYear | 2024 |
| Publisher | Elsevier B.V |
| Publisher_xml | – name: Elsevier B.V |
| References | Mirhoseini, Pham, Le, Steiner, Larsen, Zhou, Kumar, Norouzi, Bengio, Dean (b12) 2017 Abhinav Jangda, Jun Huang, Guodong Liu, Amir Hossein Nodehi Sabet, Saeed Maleki, Youshan Miao, Madanlal Musuvathi, Todd Mytkowicz, Olli Saarikivi, Breaking the computation and communication abstraction barrier in distributed machine learning workloads, in: Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2022, pp. 402–416. Shiqing Fan, Yi Rong, Chen Meng, Zongyan Cao, Siyu Wang, Zhen Zheng, Chuan Wu, Guoping Long, Jun Yang, Lixue Xia, et al., DAPPLE: A pipelined data parallel approach for training large models, in: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2021, pp. 431–445. Gao Huang, Zhuang Liu, Laurens Van Der Maaten, Kilian Q. Weinberger, Densely connected convolutional networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4700–4708. Brown, Mann, Ryder, Subbiah, Kaplan, Dhariwal, Neelakantan, Shyam, Sastry, Askell (b5) 2020; 33 Krizhevsky (b11) 2014 Fortunato (b30) 2010; 486 Symposium on Operating Systems Design and Implementation Jia, Zaharia, Aiken (b13) 2019; 1 Zaremba, Sutskever, Vinyals (b42) 2014 Shazeer, Cheng, Parmar, Tran, Vaswani, Koanantakool, Hawkins, Lee, Hong, Young (b23) 2018; 31 Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778. Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin (b16) 2017; 30 Iandola, Han, Moskewicz, Ashraf, Dally, Keutzer (b44) 2016 Huming, Pei, Licheng, Shuyuan, Biao (b10) 2018; 41 Jay H. Park, Gyeongchan Yun, Chang M. Yi, Nguyen T. Nguyen, Seungmin Lee, Jaesik Choi, Sam H Noh, Young-ri Choi, Hetpipe: Enabling large DNN training on (whimpy) heterogeneous GPU clusters through integration of pipelined model parallelism and data parallelism, in: Proceedings of the 2020 USENIX Conference on Usenix Annual Technical Conference, 2020, pp. 307–321. PyTorch cuda streams Yuan, Li, Shao, Yan (b29) 2020 Narayanan, Phanishayee, Shi, Chen, Zaharia (b24) 2021 Micikevicius, Narang, Alben, Diamos, Elsen, Garcia, Ginsburg, Houston, Kuchaiev, Venkatesh (b38) 2017 Barret Zoph, Vijay Vasudevan, Jonathon Shlens, Quoc V Le, Learning transferable architectures for scalable image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8697–8710. Devlin, Chang, Lee, Toutanova (b4) 2018 Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R. Devanur, Gregory R. Ganger, Phillip B. Gibbons, Matei Zaharia, PipeDream: Generalized pipeline parallelism for DNN training, in: Proceedings of the 27th ACM Symposium on Operating Systems Principles, 2019, pp. 1–15. Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, Bor-Yiing Su, Scaling distributed machine learning with the parameter server, in: 11th Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al., Efficient large-scale language model training on gpu clusters using megatron-lm, in: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021, pp. 1–15. Huang, Cheng, Bapna, Firat, Chen, Chen, Lee, Ngiam, Le, Wu (b21) 2019; 32 Li, Zhao, Varma, Salpekar, Noordhuis, Li, Paszke, Smith, Vaughan, Damania (b37) 2020 Li, Zhenhua, Fang, Kai, Yaqian, kun (b41) 2020; 42 Yang, Zhang, Li, Ré, Aberger, De Sa (b19) 2021; 3 Vasiliauskaite, Evans, Expert (b32) 2022; 596 Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, Yuxiong He, Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning, in: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021, pp. 1–14. Simonyan, Zisserman (b39) 2014 Harlap, Narayanan, Phanishayee, Seshadri, Devanur, Ganger, Gibbons (b28) 2018 14), 2014, pp. 583–598. Lee, Kim, Zheng, Ho, Gibson, Xing (b8) 2014; 27 Shuai, Dan (b7) 2022; 45 Rajbhandari, Rasley, Ruwase, He (b33) 2020 Kim, Lee, Jeong, Baek, Yoon, Kim, Lim, Kim (b14) 2020 . Zhao, Li, Chen, Guan, Jiang, Huang, Qing, Wang, Wang, Zhang (b35) 2021; 33 Guan, Yin, Li, Lu (b27) 2019 Hongrui, Guojun, Chengji, Guangming, Zhan, Zhongzhe, Xiaoyang, Xuejun (b1) 2021; 58 Jin Kyu Kim, Qirong Ho, Seunghak Lee, Xun Zheng, Wei Dai, Garth A. Gibson, Eric P. Xing, Strads: A distributed framework for scheduled model parallel machine learning, in: Proceedings of the Eleventh European Conference on Computer Systems, 2016, pp. 1–16. Bárány, Grinberg (b22) 2015; 206 Sanjith Athlur, Nitika Saran, Muthian Sivathanu, Ramachandran Ramjee, Nipun Kwatra, Varuna: scalable, low-cost training of massive deep learning models, in: Proceedings of the Seventeenth European Conference on Computer Systems, 2022, pp. 472–487. Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, Yuxiong He, Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters, in: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020, pp. 3505–3506. Vasiliauskaite, Evans (b31) 2020; 5 Esteban Real, Alok Aggarwal, Yanping Huang, Quoc V. Le, Regularized evolution for image classifier architecture search, in: Proceedings of the Aaai Conference on Artificial Intelligence, Vol. 33, No. 01, 2019, pp. 4780–4789. Harlap (10.1016/j.future.2023.11.004_b28) 2018 Mirhoseini (10.1016/j.future.2023.11.004_b12) 2017 Devlin (10.1016/j.future.2023.11.004_b4) 2018 Rajbhandari (10.1016/j.future.2023.11.004_b33) 2020 10.1016/j.future.2023.11.004_b40 10.1016/j.future.2023.11.004_b20 Zhao (10.1016/j.future.2023.11.004_b35) 2021; 33 Li (10.1016/j.future.2023.11.004_b41) 2020; 42 10.1016/j.future.2023.11.004_b43 10.1016/j.future.2023.11.004_b45 Huming (10.1016/j.future.2023.11.004_b10) 2018; 41 10.1016/j.future.2023.11.004_b26 Li (10.1016/j.future.2023.11.004_b37) 2020 10.1016/j.future.2023.11.004_b6 10.1016/j.future.2023.11.004_b25 Lee (10.1016/j.future.2023.11.004_b8) 2014; 27 10.1016/j.future.2023.11.004_b9 Bárány (10.1016/j.future.2023.11.004_b22) 2015; 206 Brown (10.1016/j.future.2023.11.004_b5) 2020; 33 Vasiliauskaite (10.1016/j.future.2023.11.004_b31) 2020; 5 Narayanan (10.1016/j.future.2023.11.004_b24) 2021 10.1016/j.future.2023.11.004_b2 Kim (10.1016/j.future.2023.11.004_b14) 2020 Simonyan (10.1016/j.future.2023.11.004_b39) 2014 10.1016/j.future.2023.11.004_b3 Vasiliauskaite (10.1016/j.future.2023.11.004_b32) 2022; 596 Micikevicius (10.1016/j.future.2023.11.004_b38) 2017 Vaswani (10.1016/j.future.2023.11.004_b16) 2017; 30 Huang (10.1016/j.future.2023.11.004_b21) 2019; 32 Guan (10.1016/j.future.2023.11.004_b27) 2019 Yang (10.1016/j.future.2023.11.004_b19) 2021; 3 10.1016/j.future.2023.11.004_b34 10.1016/j.future.2023.11.004_b15 Shazeer (10.1016/j.future.2023.11.004_b23) 2018; 31 10.1016/j.future.2023.11.004_b36 10.1016/j.future.2023.11.004_b17 Krizhevsky (10.1016/j.future.2023.11.004_b11) 2014 Yuan (10.1016/j.future.2023.11.004_b29) 2020 10.1016/j.future.2023.11.004_b18 Hongrui (10.1016/j.future.2023.11.004_b1) 2021; 58 Jia (10.1016/j.future.2023.11.004_b13) 2019; 1 Shuai (10.1016/j.future.2023.11.004_b7) 2022; 45 Zaremba (10.1016/j.future.2023.11.004_b42) 2014 Fortunato (10.1016/j.future.2023.11.004_b30) 2010; 486 Iandola (10.1016/j.future.2023.11.004_b44) 2016 |
| References_xml | – reference: Barret Zoph, Vijay Vasudevan, Jonathon Shlens, Quoc V Le, Learning transferable architectures for scalable image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8697–8710. – year: 2014 ident: b42 article-title: Recurrent neural network regularization – volume: 41 start-page: 1861 year: 2018 end-page: 1881 ident: b10 article-title: Review of parallel deep neural network publication-title: Chinese J. Comput. – year: 2020 ident: b14 article-title: Torchgpipe: On-the-fly pipeline parallelism for training giant models – reference: Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778. – volume: 45 start-page: 28 year: 2022 ident: b7 article-title: Research progress on network performance optimization of distributed machine learning system publication-title: Chinese J. Comput. – volume: 27 year: 2014 ident: b8 article-title: On model parallelization and scheduling strategies for distributed machine learning publication-title: Adv. Neural Inf. Process. Syst. – volume: 31 year: 2018 ident: b23 article-title: Mesh-tensorflow: Deep learning for supercomputers publication-title: Adv. Neural Inf. Process. Syst. – year: 2020 ident: b37 article-title: Pytorch distributed: Experiences on accelerating data parallel training – volume: 486 start-page: 75 year: 2010 end-page: 174 ident: b30 article-title: Community detection in graphs publication-title: Phys. Rep. – year: 2014 ident: b11 article-title: One weird trick for parallelizing convolutional neural networks – year: 2017 ident: b38 article-title: Mixed precision training – volume: 596 year: 2022 ident: b32 article-title: Cycle analysis of directed acyclic graphs publication-title: Physica A – reference: } Symposium on Operating Systems Design and Implementation ({ – reference: Gao Huang, Zhuang Liu, Laurens Van Der Maaten, Kilian Q. Weinberger, Densely connected convolutional networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4700–4708. – volume: 33 start-page: 1877 year: 2020 end-page: 1901 ident: b5 article-title: Language models are few-shot learners publication-title: Adv. Neural Inf. Process. Syst. – volume: 42 start-page: 1529 year: 2020 end-page: 1537 ident: b41 article-title: An automatic model splitting strategy generationmethod for model parallel training publication-title: Comput. Eng. Sci. – volume: 3 start-page: 269 year: 2021 end-page: 296 ident: b19 article-title: Pipemare: Asynchronous pipeline parallel dnn training publication-title: Proc. Mach. Learn. Syst. – reference: Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, Yuxiong He, Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters, in: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020, pp. 3505–3506. – year: 2018 ident: b28 article-title: Pipedream: Fast and efficient pipeline parallel dnn training – reference: Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, Bor-Yiing Su, Scaling distributed machine learning with the parameter server, in: 11th { – volume: 5 year: 2020 ident: b31 article-title: Making communities show respect for order publication-title: Appl. Netw. Sci. – start-page: 2430 year: 2017 end-page: 2439 ident: b12 article-title: Device placement optimization with reinforcement learning publication-title: International Conference on Machine Learning – volume: 206 start-page: 155 year: 2015 end-page: 164 ident: b22 article-title: Block partitions of sequences publication-title: Israel J. Math. – reference: Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R. Devanur, Gregory R. Ganger, Phillip B. Gibbons, Matei Zaharia, PipeDream: Generalized pipeline parallelism for DNN training, in: Proceedings of the 27th ACM Symposium on Operating Systems Principles, 2019, pp. 1–15. – volume: 32 year: 2019 ident: b21 article-title: Gpipe: Efficient training of giant neural networks using pipeline parallelism publication-title: Adv. Neural Inf. Process. Syst. – start-page: 7937 year: 2021 end-page: 7947 ident: b24 article-title: Memory-efficient pipeline-parallel dnn training publication-title: International Conference on Machine Learning – year: 2018 ident: b4 article-title: Bert: Pre-training of deep bidirectional transformers for language understanding – reference: . – start-page: 737 year: 2020 end-page: 753 ident: b29 article-title: Learning connectivity of neural networks from a topological perspective publication-title: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI – volume: 58 start-page: 98 year: 2021 end-page: 115 ident: b1 article-title: Survey on network of distributed deep learning training publication-title: J. Comput. Res. Develop. – reference: } 14), 2014, pp. 583–598. – volume: 33 start-page: 489 year: 2021 end-page: 506 ident: b35 article-title: Vpipe: A virtualized acceleration system for achieving efficient and scalable pipeline parallel dnn training publication-title: IEEE Trans. Parallel Distrib. Syst. – reference: Abhinav Jangda, Jun Huang, Guodong Liu, Amir Hossein Nodehi Sabet, Saeed Maleki, Youshan Miao, Madanlal Musuvathi, Todd Mytkowicz, Olli Saarikivi, Breaking the computation and communication abstraction barrier in distributed machine learning workloads, in: Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2022, pp. 402–416. – year: 2019 ident: b27 article-title: Xpipe: Efficient pipeline model parallelism for multi-GPU DNN training – year: 2016 ident: b44 article-title: SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and¡ 0.5 MB model size – reference: Shiqing Fan, Yi Rong, Chen Meng, Zongyan Cao, Siyu Wang, Zhen Zheng, Chuan Wu, Guoping Long, Jun Yang, Lixue Xia, et al., DAPPLE: A pipelined data parallel approach for training large models, in: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2021, pp. 431–445. – reference: Esteban Real, Alok Aggarwal, Yanping Huang, Quoc V. Le, Regularized evolution for image classifier architecture search, in: Proceedings of the Aaai Conference on Artificial Intelligence, Vol. 33, No. 01, 2019, pp. 4780–4789. – reference: Jin Kyu Kim, Qirong Ho, Seunghak Lee, Xun Zheng, Wei Dai, Garth A. Gibson, Eric P. Xing, Strads: A distributed framework for scheduled model parallel machine learning, in: Proceedings of the Eleventh European Conference on Computer Systems, 2016, pp. 1–16. – volume: 1 start-page: 1 year: 2019 end-page: 13 ident: b13 article-title: Beyond data and model parallelism for deep neural networks. publication-title: Proc. Mach. Learn. Syst. – reference: Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al., Efficient large-scale language model training on gpu clusters using megatron-lm, in: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021, pp. 1–15. – reference: Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, Yuxiong He, Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning, in: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021, pp. 1–14. – volume: 30 year: 2017 ident: b16 article-title: Attention is all you need publication-title: Adv. Neural Inf. Process. Syst. – reference: Jay H. Park, Gyeongchan Yun, Chang M. Yi, Nguyen T. Nguyen, Seungmin Lee, Jaesik Choi, Sam H Noh, Young-ri Choi, Hetpipe: Enabling large DNN training on (whimpy) heterogeneous GPU clusters through integration of pipelined model parallelism and data parallelism, in: Proceedings of the 2020 USENIX Conference on Usenix Annual Technical Conference, 2020, pp. 307–321. – reference: Sanjith Athlur, Nitika Saran, Muthian Sivathanu, Ramachandran Ramjee, Nipun Kwatra, Varuna: scalable, low-cost training of massive deep learning models, in: Proceedings of the Seventeenth European Conference on Computer Systems, 2022, pp. 472–487. – reference: PyTorch cuda streams, – year: 2014 ident: b39 article-title: Very deep convolutional networks for large-scale image recognition – start-page: 1 year: 2020 end-page: 16 ident: b33 article-title: Zero: Memory optimizations toward training trillion parameter models publication-title: SC20: International Conference for High Performance Computing, Networking, Storage and Analysis – volume: 30 year: 2017 ident: 10.1016/j.future.2023.11.004_b16 article-title: Attention is all you need publication-title: Adv. Neural Inf. Process. Syst. – year: 2016 ident: 10.1016/j.future.2023.11.004_b44 – ident: 10.1016/j.future.2023.11.004_b2 doi: 10.1609/aaai.v33i01.33014780 – volume: 3 start-page: 269 year: 2021 ident: 10.1016/j.future.2023.11.004_b19 article-title: Pipemare: Asynchronous pipeline parallel dnn training publication-title: Proc. Mach. Learn. Syst. – ident: 10.1016/j.future.2023.11.004_b34 doi: 10.1145/3492321.3519584 – year: 2020 ident: 10.1016/j.future.2023.11.004_b37 – volume: 33 start-page: 1877 year: 2020 ident: 10.1016/j.future.2023.11.004_b5 article-title: Language models are few-shot learners publication-title: Adv. Neural Inf. Process. Syst. – ident: 10.1016/j.future.2023.11.004_b25 doi: 10.1145/3437801.3441593 – ident: 10.1016/j.future.2023.11.004_b43 doi: 10.1109/CVPR.2017.243 – ident: 10.1016/j.future.2023.11.004_b40 doi: 10.1109/CVPR.2016.90 – year: 2017 ident: 10.1016/j.future.2023.11.004_b38 – volume: 41 start-page: 1861 issue: 8 year: 2018 ident: 10.1016/j.future.2023.11.004_b10 article-title: Review of parallel deep neural network publication-title: Chinese J. Comput. – year: 2019 ident: 10.1016/j.future.2023.11.004_b27 – year: 2014 ident: 10.1016/j.future.2023.11.004_b39 – ident: 10.1016/j.future.2023.11.004_b15 doi: 10.1145/3458817.3476209 – volume: 31 year: 2018 ident: 10.1016/j.future.2023.11.004_b23 article-title: Mesh-tensorflow: Deep learning for supercomputers publication-title: Adv. Neural Inf. Process. Syst. – ident: 10.1016/j.future.2023.11.004_b20 doi: 10.1145/3394486.3406703 – volume: 32 year: 2019 ident: 10.1016/j.future.2023.11.004_b21 article-title: Gpipe: Efficient training of giant neural networks using pipeline parallelism publication-title: Adv. Neural Inf. Process. Syst. – volume: 33 start-page: 489 issue: 3 year: 2021 ident: 10.1016/j.future.2023.11.004_b35 article-title: Vpipe: A virtualized acceleration system for achieving efficient and scalable pipeline parallel dnn training publication-title: IEEE Trans. Parallel Distrib. Syst. doi: 10.1109/TPDS.2021.3094364 – volume: 58 start-page: 98 issue: 01 year: 2021 ident: 10.1016/j.future.2023.11.004_b1 article-title: Survey on network of distributed deep learning training publication-title: J. Comput. Res. Develop. – year: 2020 ident: 10.1016/j.future.2023.11.004_b14 – volume: 206 start-page: 155 year: 2015 ident: 10.1016/j.future.2023.11.004_b22 article-title: Block partitions of sequences publication-title: Israel J. Math. doi: 10.1007/s11856-014-1137-5 – ident: 10.1016/j.future.2023.11.004_b36 – volume: 1 start-page: 1 year: 2019 ident: 10.1016/j.future.2023.11.004_b13 article-title: Beyond data and model parallelism for deep neural networks. publication-title: Proc. Mach. Learn. Syst. – ident: 10.1016/j.future.2023.11.004_b45 doi: 10.1145/3458817.3476205 – year: 2014 ident: 10.1016/j.future.2023.11.004_b11 – volume: 45 start-page: 28 issue: 7 year: 2022 ident: 10.1016/j.future.2023.11.004_b7 article-title: Research progress on network performance optimization of distributed machine learning system publication-title: Chinese J. Comput. – volume: 486 start-page: 75 issue: 3–5 year: 2010 ident: 10.1016/j.future.2023.11.004_b30 article-title: Community detection in graphs publication-title: Phys. Rep. doi: 10.1016/j.physrep.2009.11.002 – volume: 5 issue: 1 year: 2020 ident: 10.1016/j.future.2023.11.004_b31 article-title: Making communities show respect for order publication-title: Appl. Netw. Sci. doi: 10.1007/s41109-020-00255-5 – volume: 596 year: 2022 ident: 10.1016/j.future.2023.11.004_b32 article-title: Cycle analysis of directed acyclic graphs publication-title: Physica A doi: 10.1016/j.physa.2022.127097 – start-page: 737 year: 2020 ident: 10.1016/j.future.2023.11.004_b29 article-title: Learning connectivity of neural networks from a topological perspective – ident: 10.1016/j.future.2023.11.004_b9 doi: 10.1145/2901318.2901331 – volume: 27 year: 2014 ident: 10.1016/j.future.2023.11.004_b8 article-title: On model parallelization and scheduling strategies for distributed machine learning publication-title: Adv. Neural Inf. Process. Syst. – ident: 10.1016/j.future.2023.11.004_b3 doi: 10.1109/CVPR.2018.00907 – ident: 10.1016/j.future.2023.11.004_b6 – start-page: 2430 year: 2017 ident: 10.1016/j.future.2023.11.004_b12 article-title: Device placement optimization with reinforcement learning – start-page: 7937 year: 2021 ident: 10.1016/j.future.2023.11.004_b24 article-title: Memory-efficient pipeline-parallel dnn training – ident: 10.1016/j.future.2023.11.004_b26 doi: 10.1145/3503222.3507778 – ident: 10.1016/j.future.2023.11.004_b17 doi: 10.1145/3341301.3359646 – ident: 10.1016/j.future.2023.11.004_b18 – year: 2018 ident: 10.1016/j.future.2023.11.004_b28 – year: 2018 ident: 10.1016/j.future.2023.11.004_b4 – volume: 42 start-page: 1529 issue: 9 year: 2020 ident: 10.1016/j.future.2023.11.004_b41 article-title: An automatic model splitting strategy generationmethod for model parallel training publication-title: Comput. Eng. Sci. – year: 2014 ident: 10.1016/j.future.2023.11.004_b42 – start-page: 1 year: 2020 ident: 10.1016/j.future.2023.11.004_b33 article-title: Zero: Memory optimizations toward training trillion parameter models |
| SSID | ssj0001731 |
| Score | 2.4112895 |
| Snippet | To address the challenges related to segmentation complexity, high memory usage, extended training duration, and low equipment utilization in parallel... |
| SourceID | crossref elsevier |
| SourceType | Enrichment Source Index Database Publisher |
| StartPage | 317 |
| SubjectTerms | Asynchronous pipeline parallelism Augmented antichain Computation–communication overlap DNN model parallelism Model segmentation |
| Title | APapo: An asynchronous parallel optimization method for DNN models |
| URI | https://dx.doi.org/10.1016/j.future.2023.11.004 |
| Volume | 152 |
| WOSCitedRecordID | wos001112403400001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVESC databaseName: Elsevier SD Freedom Collection Journals 2021 customDbUrl: eissn: 1872-7115 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0001731 issn: 0167-739X databaseCode: AIEXJ dateStart: 19950201 isFulltext: true titleUrlDefault: https://www.sciencedirect.com providerName: Elsevier |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV3JTsMwELVQ4cCFHbHLB66pSuzECbcCrQChikOReoscxxVFJa1oi-DvGduTNCxiOXCJKitx2jxn_Dwdv0fIcahhmvYz5oUw_Xv8pJ96UknpZQ3W51HsSy0yazYhOp2o14tv0eJ-Yu0ERJ5HLy_x-F-hhjYA22yd_QPcZafQAJ8BdDgC7HD8FfDNWzkeYb5PTl5zZeRvTaGrUfkeDjWQTwgTj7j_Ei2kbbXhRafjnHEmVcratqojxmpZ42hR6ASBMtAlK78ZzGwy9X4mB2Vljm3qylE1veDzeX1VkXGESCqY9budh0ynOotBj7ndlzh_Mvc_y6fQ7LIED3WnlVI3tu11o5_q3IffK2F_mKHKusGiJO0hcb0kphdYxCRWEnbRF0Ec1chi86rVuy7n4xOBrpT4Q4oNlLbK7_O3-ZqgVEhHd42s4GqBNh3K62RB5xtktXDioBiYN8mZBf2UNnNahZwWkNMq5NRBTgFyCpBTB_kWuWu3uueXHrpjeAqWeVOgFjwOY6Yg4IpAhRJ4Gee-Eo1UpEC6mOSZ4n1YoUbwksoATgE2r9IoEyZ_rAO2TWr5KNc7hKZh2sj8RpwFvuLQVSoyFoWayVTFWgThLmHFI0kUSscbB5Nh8h0gu8Qrrxo76ZQfzhfF006Q_jlal8AQ-vbKvT_eaZ8sz4f6AalNn2b6kCyp5-lg8nSE4-cNPsp9ow |
| linkProvider | Elsevier |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=APapo%3A+An+asynchronous+parallel+optimization+method+for+DNN+models&rft.jtitle=Future+generation+computer+systems&rft.au=Liu%2C+Shuai&rft.au=Ju%2C+Tao&rft.date=2024-03-01&rft.issn=0167-739X&rft.volume=152&rft.spage=317&rft.epage=330&rft_id=info:doi/10.1016%2Fj.future.2023.11.004&rft.externalDBID=n%2Fa&rft.externalDocID=10_1016_j_future_2023_11_004 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0167-739X&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0167-739X&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0167-739X&client=summon |