Stochastic Normalized Gradient Descent with Momentum for Large-Batch Training

Stochastic gradient descent~(SGD) and its variants have been the dominating optimization methods in machine learning. Compared to SGD with small-batch training, SGD with large-batch training can better utilize the computational power of current multi-core systems such as graphics processing units~(G...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:arXiv.org
Hauptverfasser: Shen-Yi, Zhao, Chang-Wei, Shi, Yin-Peng, Xie, Wu-Jun, Li
Format: Paper
Sprache:Englisch
Veröffentlicht: Ithaca Cornell University Library, arXiv.org 15.04.2024
Schlagworte:
ISSN:2331-8422
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Abstract Stochastic gradient descent~(SGD) and its variants have been the dominating optimization methods in machine learning. Compared to SGD with small-batch training, SGD with large-batch training can better utilize the computational power of current multi-core systems such as graphics processing units~(GPUs) and can reduce the number of communication rounds in distributed training settings. Thus, SGD with large-batch training has attracted considerable attention. However, existing empirical results showed that large-batch training typically leads to a drop in generalization accuracy. Hence, how to guarantee the generalization ability in large-batch training becomes a challenging task. In this paper, we propose a simple yet effective method, called stochastic normalized gradient descent with momentum~(SNGM), for large-batch training. We prove that with the same number of gradient computations, SNGM can adopt a larger batch size than momentum SGD~(MSGD), which is one of the most widely used variants of SGD, to converge to an \(\epsilon\)-stationary point. Empirical results on deep learning verify that when adopting the same large batch size, SNGM can achieve better test accuracy than MSGD and other state-of-the-art large-batch training methods.
AbstractList Stochastic gradient descent~(SGD) and its variants have been the dominating optimization methods in machine learning. Compared to SGD with small-batch training, SGD with large-batch training can better utilize the computational power of current multi-core systems such as graphics processing units~(GPUs) and can reduce the number of communication rounds in distributed training settings. Thus, SGD with large-batch training has attracted considerable attention. However, existing empirical results showed that large-batch training typically leads to a drop in generalization accuracy. Hence, how to guarantee the generalization ability in large-batch training becomes a challenging task. In this paper, we propose a simple yet effective method, called stochastic normalized gradient descent with momentum~(SNGM), for large-batch training. We prove that with the same number of gradient computations, SNGM can adopt a larger batch size than momentum SGD~(MSGD), which is one of the most widely used variants of SGD, to converge to an \(\epsilon\)-stationary point. Empirical results on deep learning verify that when adopting the same large batch size, SNGM can achieve better test accuracy than MSGD and other state-of-the-art large-batch training methods.
Author Wu-Jun, Li
Chang-Wei, Shi
Yin-Peng, Xie
Shen-Yi, Zhao
Author_xml – sequence: 1
  givenname: Zhao
  surname: Shen-Yi
  fullname: Shen-Yi, Zhao
– sequence: 2
  givenname: Shi
  surname: Chang-Wei
  fullname: Chang-Wei, Shi
– sequence: 3
  givenname: Xie
  surname: Yin-Peng
  fullname: Yin-Peng, Xie
– sequence: 4
  givenname: Li
  surname: Wu-Jun
  fullname: Wu-Jun, Li
BookMark eNotjbFOwzAUAC0EEqX0A9gsMafY78V2MkKBgtTC0O7Vi-00rpoYnBQQX08RTHfT3QU77WLnGbuSYpoXSokbSl_hYwpCmKnEslAnbASIMitygHM26fudEAK0AaVwxJarIdqG-iFY_hJTS_vw7R2fJ3LBdwO_97395WcYGr6M7dEPLa9j4gtKW5_d0WAbvk4UutBtL9lZTfveT_45ZqvHh_XsKVu8zp9nt4uMFJjMYmkMYlVZpVRVGZ0bkASy1oCOnM8VGlfpnABNUaMrhKukKbX1VFqlccyu_6pvKb4ffD9sdvGQuuNwAzkUoKXRBn8AgGBQtg
ContentType Paper
Copyright 2024. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Copyright_xml – notice: 2024. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
DBID 8FE
8FG
ABJCF
ABUWG
AFKRA
AZQEC
BENPR
BGLVJ
CCPQU
DWQXO
HCIFZ
L6V
M7S
PHGZM
PHGZT
PIMPY
PKEHL
PQEST
PQGLB
PQQKQ
PQUKI
PRINS
PTHSS
DOI 10.48550/arxiv.2007.13985
DatabaseName ProQuest SciTech Collection
ProQuest Technology Collection
Materials Science & Engineering Collection
ProQuest Central (Alumni)
ProQuest Central UK/Ireland
ProQuest Central Essentials
ProQuest Central
ProQuest Technology Collection
ProQuest One Community College
ProQuest Central Korea
ProQuest SciTech Premium Collection
ProQuest Engineering Collection
Engineering Database
ProQuest Central Premium
ProQuest One Academic (New)
ProQuest Publicly Available Content Database
ProQuest One Academic Middle East (New)
ProQuest One Academic Eastern Edition (DO NOT USE)
ProQuest One Applied & Life Sciences
ProQuest One Academic (retired)
ProQuest One Academic UKI Edition
ProQuest Central China
Engineering Collection
DatabaseTitle Publicly Available Content Database
Engineering Database
Technology Collection
ProQuest One Academic Middle East (New)
ProQuest Central Essentials
ProQuest One Academic Eastern Edition
ProQuest Central (Alumni Edition)
SciTech Premium Collection
ProQuest One Community College
ProQuest Technology Collection
ProQuest SciTech Collection
ProQuest Central China
ProQuest Central
ProQuest One Applied & Life Sciences
ProQuest Engineering Collection
ProQuest One Academic UKI Edition
ProQuest Central Korea
Materials Science & Engineering Collection
ProQuest Central (New)
ProQuest One Academic
ProQuest One Academic (New)
Engineering Collection
DatabaseTitleList Publicly Available Content Database
Database_xml – sequence: 1
  dbid: PIMPY
  name: Publicly Available Content Database
  url: http://search.proquest.com/publiccontent
  sourceTypes: Aggregation Database
DeliveryMethod fulltext_linktorsrc
Discipline Physics
EISSN 2331-8422
Genre Working Paper/Pre-Print
GroupedDBID 8FE
8FG
ABJCF
ABUWG
AFKRA
ALMA_UNASSIGNED_HOLDINGS
AZQEC
BENPR
BGLVJ
CCPQU
DWQXO
FRJ
HCIFZ
L6V
M7S
M~E
PHGZM
PHGZT
PIMPY
PKEHL
PQEST
PQGLB
PQQKQ
PQUKI
PRINS
PTHSS
ID FETCH-LOGICAL-a527-c397733bbc555bb764721a21f623dade4537db64a2378f3d80db1796cea9c563
IEDL.DBID PIMPY
IngestDate Mon Jun 30 09:11:21 EDT 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a527-c397733bbc555bb764721a21f623dade4537db64a2378f3d80db1796cea9c563
Notes SourceType-Working Papers-1
ObjectType-Working Paper/Pre-Print-1
content type line 50
OpenAccessLink https://www.proquest.com/publiccontent/docview/2428261767?pq-origsite=%requestingapplication%
PQID 2428261767
PQPubID 2050157
ParticipantIDs proquest_journals_2428261767
PublicationCentury 2000
PublicationDate 20240415
PublicationDateYYYYMMDD 2024-04-15
PublicationDate_xml – month: 04
  year: 2024
  text: 20240415
  day: 15
PublicationDecade 2020
PublicationPlace Ithaca
PublicationPlace_xml – name: Ithaca
PublicationTitle arXiv.org
PublicationYear 2024
Publisher Cornell University Library, arXiv.org
Publisher_xml – name: Cornell University Library, arXiv.org
SSID ssj0002672553
Score 1.8666726
SecondaryResourceType preprint
Snippet Stochastic gradient descent~(SGD) and its variants have been the dominating optimization methods in machine learning. Compared to SGD with small-batch...
SourceID proquest
SourceType Aggregation Database
SubjectTerms Computation
Machine learning
Momentum
Optimization
Training
Title Stochastic Normalized Gradient Descent with Momentum for Large-Batch Training
URI https://www.proquest.com/docview/2428261767
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV09T8MwFLSgBYmJb_FRKg-sVlsnTpwJqdACEo0i2qFMlT_VStCWJK0Qvx6_JIUBiYnZi2XrnU_vne8QumbWd6inNKGKRsRVoiGOZUREcMvcfQcdwcuwiTCO-XgcJdX36KySVW4wsQDq0u0ZdNsOhFt6oaBj3nIPCwcv8SC8Wb4TyJCCWWsVqLGN6mC81a6hevI4SF6-ey40CB2D9srhZmHl1RLpx2xdOhk6LsTZL0gu3pn-_v_u8MDtTCxNeoi2zPwI7RZqT5Udo8EwX6ipAItmHANpfZ19Go3v00L_leO70uMJQ5MWD8CjIV-9YUdv8RMIx0nX4fcUj6p0iRM07PdGtw-kylUggtGQKOB8nielYoxJGYKBfEfQjnVMSAttfOaFWga-oF7Irad5W0tXtoEyIlIs8E5Rbb6YmzOEJZU-s2DSpYUjBloqa6WvuaY2Cvy2OEeNzUlNqtLIJj8Hc_H38iXao45BwOimwxqolqcrc4V21DqfZWkT1bu9OHluglhz2Kxu-gtrirjB
linkProvider ProQuest
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMw1V3JTsMwEB1BAcGJXez4AEerrRNnOSAk1lY0VSV6gAuVt6iVoC1pWpZv4iMZJykckLhx4BzJSvLGM88z4zcARzx20espTZliIcWdaCiyjJCKIOaIt1cVQT5swm82g7u7sDUDH9O7MLatcuoTM0etB8rmyMsYSgKrHu75p8NnaqdG2erqdIRGbhY35u0Fj2yjk_oF4nvM2NVl-7xGi6kCVHDmU2UZj-NIqTjnUvpWPr0qWDVGHqCFNi53fC09VzDHD2JHBxUt0Wg9ZUSouOfgqrMw56KpV0ow16pHrfuvnA7zfGToTl48zaTCyiJ57U1ypUTkWgH_4fKzOHa1_L_-wAp-uRiaZBVmTH8NFrJuVTVah-g2HaiusBLTpGlJ92Pv3WhynWT9aym5yDWqiE0yk8hqTKTjJ4L0nDRs4zs9w_jTJe1iOsYG3P7B-29CqT_omy0gkkmXx1ZkTAskNlqqOJauDjSLQ8-tiG3YmyLRKbb2qPMNw87vjw9hsdaOGp1GvXmzC0sM2ZAtQ1X5HpTSZGz2YV5N0t4oOSjsiMDD38L2CQBrBOc
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Stochastic+Normalized+Gradient+Descent+with+Momentum+for+Large-Batch+Training&rft.jtitle=arXiv.org&rft.au=Shen-Yi%2C+Zhao&rft.au=Chang-Wei%2C+Shi&rft.au=Yin-Peng%2C+Xie&rft.au=Wu-Jun%2C+Li&rft.date=2024-04-15&rft.pub=Cornell+University+Library%2C+arXiv.org&rft.eissn=2331-8422&rft_id=info:doi/10.48550%2Farxiv.2007.13985