Stochastic Normalized Gradient Descent with Momentum for Large-Batch Training
Stochastic gradient descent~(SGD) and its variants have been the dominating optimization methods in machine learning. Compared to SGD with small-batch training, SGD with large-batch training can better utilize the computational power of current multi-core systems such as graphics processing units~(G...
Gespeichert in:
| Veröffentlicht in: | arXiv.org |
|---|---|
| Hauptverfasser: | , , , |
| Format: | Paper |
| Sprache: | Englisch |
| Veröffentlicht: |
Ithaca
Cornell University Library, arXiv.org
15.04.2024
|
| Schlagworte: | |
| ISSN: | 2331-8422 |
| Online-Zugang: | Volltext |
| Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
| Abstract | Stochastic gradient descent~(SGD) and its variants have been the dominating optimization methods in machine learning. Compared to SGD with small-batch training, SGD with large-batch training can better utilize the computational power of current multi-core systems such as graphics processing units~(GPUs) and can reduce the number of communication rounds in distributed training settings. Thus, SGD with large-batch training has attracted considerable attention. However, existing empirical results showed that large-batch training typically leads to a drop in generalization accuracy. Hence, how to guarantee the generalization ability in large-batch training becomes a challenging task. In this paper, we propose a simple yet effective method, called stochastic normalized gradient descent with momentum~(SNGM), for large-batch training. We prove that with the same number of gradient computations, SNGM can adopt a larger batch size than momentum SGD~(MSGD), which is one of the most widely used variants of SGD, to converge to an \(\epsilon\)-stationary point. Empirical results on deep learning verify that when adopting the same large batch size, SNGM can achieve better test accuracy than MSGD and other state-of-the-art large-batch training methods. |
|---|---|
| AbstractList | Stochastic gradient descent~(SGD) and its variants have been the dominating optimization methods in machine learning. Compared to SGD with small-batch training, SGD with large-batch training can better utilize the computational power of current multi-core systems such as graphics processing units~(GPUs) and can reduce the number of communication rounds in distributed training settings. Thus, SGD with large-batch training has attracted considerable attention. However, existing empirical results showed that large-batch training typically leads to a drop in generalization accuracy. Hence, how to guarantee the generalization ability in large-batch training becomes a challenging task. In this paper, we propose a simple yet effective method, called stochastic normalized gradient descent with momentum~(SNGM), for large-batch training. We prove that with the same number of gradient computations, SNGM can adopt a larger batch size than momentum SGD~(MSGD), which is one of the most widely used variants of SGD, to converge to an \(\epsilon\)-stationary point. Empirical results on deep learning verify that when adopting the same large batch size, SNGM can achieve better test accuracy than MSGD and other state-of-the-art large-batch training methods. |
| Author | Wu-Jun, Li Chang-Wei, Shi Yin-Peng, Xie Shen-Yi, Zhao |
| Author_xml | – sequence: 1 givenname: Zhao surname: Shen-Yi fullname: Shen-Yi, Zhao – sequence: 2 givenname: Shi surname: Chang-Wei fullname: Chang-Wei, Shi – sequence: 3 givenname: Xie surname: Yin-Peng fullname: Yin-Peng, Xie – sequence: 4 givenname: Li surname: Wu-Jun fullname: Wu-Jun, Li |
| BookMark | eNotjbFOwzAUAC0EEqX0A9gsMafY78V2MkKBgtTC0O7Vi-00rpoYnBQQX08RTHfT3QU77WLnGbuSYpoXSokbSl_hYwpCmKnEslAnbASIMitygHM26fudEAK0AaVwxJarIdqG-iFY_hJTS_vw7R2fJ3LBdwO_97395WcYGr6M7dEPLa9j4gtKW5_d0WAbvk4UutBtL9lZTfveT_45ZqvHh_XsKVu8zp9nt4uMFJjMYmkMYlVZpVRVGZ0bkASy1oCOnM8VGlfpnABNUaMrhKukKbX1VFqlccyu_6pvKb4ffD9sdvGQuuNwAzkUoKXRBn8AgGBQtg |
| ContentType | Paper |
| Copyright | 2024. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. |
| Copyright_xml | – notice: 2024. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. |
| DBID | 8FE 8FG ABJCF ABUWG AFKRA AZQEC BENPR BGLVJ CCPQU DWQXO HCIFZ L6V M7S PHGZM PHGZT PIMPY PKEHL PQEST PQGLB PQQKQ PQUKI PRINS PTHSS |
| DOI | 10.48550/arxiv.2007.13985 |
| DatabaseName | ProQuest SciTech Collection ProQuest Technology Collection Materials Science & Engineering Collection ProQuest Central (Alumni) ProQuest Central UK/Ireland ProQuest Central Essentials ProQuest Central ProQuest Technology Collection ProQuest One Community College ProQuest Central Korea ProQuest SciTech Premium Collection ProQuest Engineering Collection Engineering Database ProQuest Central Premium ProQuest One Academic (New) ProQuest Publicly Available Content Database ProQuest One Academic Middle East (New) ProQuest One Academic Eastern Edition (DO NOT USE) ProQuest One Applied & Life Sciences ProQuest One Academic (retired) ProQuest One Academic UKI Edition ProQuest Central China Engineering Collection |
| DatabaseTitle | Publicly Available Content Database Engineering Database Technology Collection ProQuest One Academic Middle East (New) ProQuest Central Essentials ProQuest One Academic Eastern Edition ProQuest Central (Alumni Edition) SciTech Premium Collection ProQuest One Community College ProQuest Technology Collection ProQuest SciTech Collection ProQuest Central China ProQuest Central ProQuest One Applied & Life Sciences ProQuest Engineering Collection ProQuest One Academic UKI Edition ProQuest Central Korea Materials Science & Engineering Collection ProQuest Central (New) ProQuest One Academic ProQuest One Academic (New) Engineering Collection |
| DatabaseTitleList | Publicly Available Content Database |
| Database_xml | – sequence: 1 dbid: PIMPY name: Publicly Available Content Database url: http://search.proquest.com/publiccontent sourceTypes: Aggregation Database |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Physics |
| EISSN | 2331-8422 |
| Genre | Working Paper/Pre-Print |
| GroupedDBID | 8FE 8FG ABJCF ABUWG AFKRA ALMA_UNASSIGNED_HOLDINGS AZQEC BENPR BGLVJ CCPQU DWQXO FRJ HCIFZ L6V M7S M~E PHGZM PHGZT PIMPY PKEHL PQEST PQGLB PQQKQ PQUKI PRINS PTHSS |
| ID | FETCH-LOGICAL-a527-c397733bbc555bb764721a21f623dade4537db64a2378f3d80db1796cea9c563 |
| IEDL.DBID | PIMPY |
| IngestDate | Mon Jun 30 09:11:21 EDT 2025 |
| IsDoiOpenAccess | true |
| IsOpenAccess | true |
| IsPeerReviewed | false |
| IsScholarly | false |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-a527-c397733bbc555bb764721a21f623dade4537db64a2378f3d80db1796cea9c563 |
| Notes | SourceType-Working Papers-1 ObjectType-Working Paper/Pre-Print-1 content type line 50 |
| OpenAccessLink | https://www.proquest.com/publiccontent/docview/2428261767?pq-origsite=%requestingapplication% |
| PQID | 2428261767 |
| PQPubID | 2050157 |
| ParticipantIDs | proquest_journals_2428261767 |
| PublicationCentury | 2000 |
| PublicationDate | 20240415 |
| PublicationDateYYYYMMDD | 2024-04-15 |
| PublicationDate_xml | – month: 04 year: 2024 text: 20240415 day: 15 |
| PublicationDecade | 2020 |
| PublicationPlace | Ithaca |
| PublicationPlace_xml | – name: Ithaca |
| PublicationTitle | arXiv.org |
| PublicationYear | 2024 |
| Publisher | Cornell University Library, arXiv.org |
| Publisher_xml | – name: Cornell University Library, arXiv.org |
| SSID | ssj0002672553 |
| Score | 1.8666726 |
| SecondaryResourceType | preprint |
| Snippet | Stochastic gradient descent~(SGD) and its variants have been the dominating optimization methods in machine learning. Compared to SGD with small-batch... |
| SourceID | proquest |
| SourceType | Aggregation Database |
| SubjectTerms | Computation Machine learning Momentum Optimization Training |
| Title | Stochastic Normalized Gradient Descent with Momentum for Large-Batch Training |
| URI | https://www.proquest.com/docview/2428261767 |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV09T8MwFLSgBYmJb_FRKg-sVlsnTpwJqdACEo0i2qFMlT_VStCWJK0Qvx6_JIUBiYnZi2XrnU_vne8QumbWd6inNKGKRsRVoiGOZUREcMvcfQcdwcuwiTCO-XgcJdX36KySVW4wsQDq0u0ZdNsOhFt6oaBj3nIPCwcv8SC8Wb4TyJCCWWsVqLGN6mC81a6hevI4SF6-ey40CB2D9srhZmHl1RLpx2xdOhk6LsTZL0gu3pn-_v_u8MDtTCxNeoi2zPwI7RZqT5Udo8EwX6ipAItmHANpfZ19Go3v00L_leO70uMJQ5MWD8CjIV-9YUdv8RMIx0nX4fcUj6p0iRM07PdGtw-kylUggtGQKOB8nielYoxJGYKBfEfQjnVMSAttfOaFWga-oF7Irad5W0tXtoEyIlIs8E5Rbb6YmzOEJZU-s2DSpYUjBloqa6WvuaY2Cvy2OEeNzUlNqtLIJj8Hc_H38iXao45BwOimwxqolqcrc4V21DqfZWkT1bu9OHluglhz2Kxu-gtrirjB |
| linkProvider | ProQuest |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMw1V3JTsMwEB1BAcGJXez4AEerrRNnOSAk1lY0VSV6gAuVt6iVoC1pWpZv4iMZJykckLhx4BzJSvLGM88z4zcARzx20espTZliIcWdaCiyjJCKIOaIt1cVQT5swm82g7u7sDUDH9O7MLatcuoTM0etB8rmyMsYSgKrHu75p8NnaqdG2erqdIRGbhY35u0Fj2yjk_oF4nvM2NVl-7xGi6kCVHDmU2UZj-NIqTjnUvpWPr0qWDVGHqCFNi53fC09VzDHD2JHBxUt0Wg9ZUSouOfgqrMw56KpV0ow16pHrfuvnA7zfGToTl48zaTCyiJ57U1ypUTkWgH_4fKzOHa1_L_-wAp-uRiaZBVmTH8NFrJuVTVah-g2HaiusBLTpGlJ92Pv3WhynWT9aym5yDWqiE0yk8hqTKTjJ4L0nDRs4zs9w_jTJe1iOsYG3P7B-29CqT_omy0gkkmXx1ZkTAskNlqqOJauDjSLQ8-tiG3YmyLRKbb2qPMNw87vjw9hsdaOGp1GvXmzC0sM2ZAtQ1X5HpTSZGz2YV5N0t4oOSjsiMDD38L2CQBrBOc |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Stochastic+Normalized+Gradient+Descent+with+Momentum+for+Large-Batch+Training&rft.jtitle=arXiv.org&rft.au=Shen-Yi%2C+Zhao&rft.au=Chang-Wei%2C+Shi&rft.au=Yin-Peng%2C+Xie&rft.au=Wu-Jun%2C+Li&rft.date=2024-04-15&rft.pub=Cornell+University+Library%2C+arXiv.org&rft.eissn=2331-8422&rft_id=info:doi/10.48550%2Farxiv.2007.13985 |