Loss landscapes and optimization in over-parameterized non-linear systems and neural networks

The success of deep learning is due, to a large extent, to the remarkable effectiveness of gradient-based optimization methods applied to large neural networks. The purpose of this work is to propose a modern view and a general mathematical framework for loss landscapes and efficient optimization in...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Applied and computational harmonic analysis Jg. 59; S. 85 - 116
Hauptverfasser: Liu, Chaoyue, Zhu, Libin, Belkin, Mikhail
Format: Journal Article
Sprache:Englisch
Veröffentlicht: Elsevier Inc 01.07.2022
Schlagworte:
ISSN:1063-5203
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Abstract The success of deep learning is due, to a large extent, to the remarkable effectiveness of gradient-based optimization methods applied to large neural networks. The purpose of this work is to propose a modern view and a general mathematical framework for loss landscapes and efficient optimization in over-parameterized machine learning models and systems of non-linear equations, a setting that includes over-parameterized deep neural networks. Our starting observation is that optimization landscapes corresponding to such systems are generally not convex, even locally around a global minimum, a condition we call essential non-convexity. We argue that instead they satisfy PL⁎, a variant of the Polyak-Łojasiewicz condition [32,25] on most (but not all) of the parameter space, which guarantees both the existence of solutions and efficient optimization by (stochastic) gradient descent (SGD/GD). The PL⁎ condition of these systems is closely related to the condition number of the tangent kernel associated to a non-linear system showing how a PL⁎-based non-linear theory parallels classical analyses of over-parameterized linear equations. We show that wide neural networks satisfy the PL⁎ condition, which explains the (S)GD convergence to a global minimum. Finally we propose a relaxation of the PL⁎ condition applicable to “almost” over-parameterized systems.
AbstractList The success of deep learning is due, to a large extent, to the remarkable effectiveness of gradient-based optimization methods applied to large neural networks. The purpose of this work is to propose a modern view and a general mathematical framework for loss landscapes and efficient optimization in over-parameterized machine learning models and systems of non-linear equations, a setting that includes over-parameterized deep neural networks. Our starting observation is that optimization landscapes corresponding to such systems are generally not convex, even locally around a global minimum, a condition we call essential non-convexity. We argue that instead they satisfy PL⁎, a variant of the Polyak-Łojasiewicz condition [32,25] on most (but not all) of the parameter space, which guarantees both the existence of solutions and efficient optimization by (stochastic) gradient descent (SGD/GD). The PL⁎ condition of these systems is closely related to the condition number of the tangent kernel associated to a non-linear system showing how a PL⁎-based non-linear theory parallels classical analyses of over-parameterized linear equations. We show that wide neural networks satisfy the PL⁎ condition, which explains the (S)GD convergence to a global minimum. Finally we propose a relaxation of the PL⁎ condition applicable to “almost” over-parameterized systems.
Author Liu, Chaoyue
Zhu, Libin
Belkin, Mikhail
Author_xml – sequence: 1
  givenname: Chaoyue
  orcidid: 0000-0003-0653-7872
  surname: Liu
  fullname: Liu, Chaoyue
  organization: Department of Computer Science and Engineering, The Ohio State University, United States of America
– sequence: 2
  givenname: Libin
  surname: Zhu
  fullname: Zhu, Libin
  organization: Department of Computer Science and Engineering, University of California, San Diego, United States of America
– sequence: 3
  givenname: Mikhail
  surname: Belkin
  fullname: Belkin, Mikhail
  email: mbelkin@ucsd.edu
  organization: Halicioğlu Data Science Institute, University of California, San Diego, United States of America
BookMark eNp9kL1OwzAUhT0UibbwAkx5gYR77aRJJBZU8SdVYoERWY59I1wSJ7JNUfv0pJSJgenc5TtX51uwmRscMXaFkCHg6nqbKf2uMg4cM-QZQD1jc4SVSAsO4pwtQtgCIOZFPWdvmyGEpFPOBK1GCsl0JcMYbW8PKtrBJdYlw458Oiqveork7YFMMv1MO-tI-STsQ6T-RDr69KqbIn4N_iNcsLNWdYEuf3PJXu_vXtaP6eb54Wl9u0m1AIgpNXnNTd6WRV5UwnCCusImr7DllS5KoJLQ6JVRbalVBW0pFDa8FI1oDFZNLpaMn3q1n-Z4auXoba_8XiLIoxS5lUcp8ihFIpeTlAmq_kDaxp_N0Svb_Y_enFCaRu0seRm0JafJWE86SjPY__BvZPWDsQ
CitedBy_id crossref_primary_10_1109_TIT_2025_3570730
crossref_primary_10_1073_pnas_2208779120
crossref_primary_10_1137_23M159860X
crossref_primary_10_1109_TAC_2024_3415459
crossref_primary_10_1109_TNNLS_2022_3195909
crossref_primary_10_1016_j_acha_2025_101777
crossref_primary_10_1016_j_neunet_2025_107945
crossref_primary_10_1039_D5NA00258C
crossref_primary_10_1109_JLT_2022_3209547
crossref_primary_10_1007_s11222_022_10120_3
crossref_primary_10_1016_j_cma_2023_116647
crossref_primary_10_3390_math11102375
crossref_primary_10_1109_TPAMI_2024_3361861
crossref_primary_10_1016_j_ejco_2022_100035
crossref_primary_10_3389_fphy_2022_1115764
crossref_primary_10_1007_s10107_025_02194_4
crossref_primary_10_1088_1361_6420_accb08
crossref_primary_10_1088_2632_2153_acf115
crossref_primary_10_1186_s13104_025_07142_1
crossref_primary_10_3390_e25081153
crossref_primary_10_1109_JLT_2024_3372015
crossref_primary_10_1016_j_acha_2023_03_004
crossref_primary_10_1137_24M1636186
crossref_primary_10_1016_j_neunet_2024_107110
crossref_primary_10_1016_j_cma_2024_117727
crossref_primary_10_1073_pnas_2301345120
crossref_primary_10_1007_s10107_024_02140_w
crossref_primary_10_1007_s10107_024_02136_6
crossref_primary_10_1016_j_cam_2024_116041
crossref_primary_10_1109_TMC_2025_3538871
crossref_primary_10_1134_S1064562424602336
crossref_primary_10_1109_TCYB_2024_3384924
crossref_primary_10_1109_TSP_2025_3533208
crossref_primary_10_1007_s10915_025_02992_0
crossref_primary_10_1007_s10589_025_00656_1
crossref_primary_10_1080_02331934_2025_2551250
crossref_primary_10_3390_jimaging9110238
crossref_primary_10_1002_nla_2558
crossref_primary_10_1016_j_neucom_2023_02_054
crossref_primary_10_1038_s41467_024_54451_3
crossref_primary_10_1007_s10107_024_02154_4
crossref_primary_10_3390_en18051265
crossref_primary_10_1109_TIT_2022_3215088
crossref_primary_10_1145_3629978
crossref_primary_10_1016_j_camwa_2025_02_004
crossref_primary_10_1016_j_spa_2025_104763
crossref_primary_10_1007_s10994_025_06737_w
crossref_primary_10_1109_TSP_2024_3412981
crossref_primary_10_1109_TAC_2023_3328510
crossref_primary_10_1109_TPAMI_2022_3196503
crossref_primary_10_1016_j_neunet_2024_106963
Cites_doi 10.1109/JSAIT.2020.2991332
10.1088/1751-8121/ab4c8b
10.1162/neco_a_01164
10.1137/040616413
10.1371/journal.pone.0236661
10.1109/72.410380
10.1109/TIT.2018.2854560
10.1137/19M1308943
10.1073/pnas.1806579115
10.1073/pnas.1903070116
10.1007/s10994-019-05839-6
ContentType Journal Article
Copyright 2021 Elsevier Inc.
Copyright_xml – notice: 2021 Elsevier Inc.
DBID AAYXX
CITATION
DOI 10.1016/j.acha.2021.12.009
DatabaseName CrossRef
DatabaseTitle CrossRef
DatabaseTitleList
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
Mathematics
EndPage 116
ExternalDocumentID 10_1016_j_acha_2021_12_009
S106352032100110X
GroupedDBID --K
--M
.~1
0R~
1B1
1RT
1~.
1~5
23M
4.4
457
4G.
5GY
5VS
7-5
71M
8P~
9JN
AACTN
AAEDT
AAEDW
AAIKJ
AAKOC
AALRI
AAOAW
AAQFI
AAQXK
AASFE
AATTM
AAXKI
AAXUO
ABAOU
ABFNM
ABJNI
ABMAC
ABWVN
ABXDB
ACDAQ
ACGFS
ACRLP
ACRPL
ADBBV
ADEZE
ADFGL
ADMUD
ADNMO
ADVLN
AEBSH
AEIPS
AEKER
AENEX
AEXQZ
AFJKZ
AFTJW
AGHFR
AGUBO
AGYEJ
AHHHB
AIEXJ
AIGVJ
AIKHN
AITUG
AKRWK
ALMA_UNASSIGNED_HOLDINGS
AMRAJ
ANKPU
ARUGR
ASPBG
AVWKF
AXJTR
AZFZN
BKOJK
BLXMC
BNPGV
CAG
COF
CS3
DM4
EBS
EFBJH
EJD
EO8
EO9
EP2
EP3
F5P
FDB
FEDTE
FGOYB
FIRID
FNPLU
FYGXN
G-2
G-Q
GBLVA
HVGLF
HZ~
IHE
IXB
J1W
KOM
LG5
M26
M41
MCRUF
MHUIS
MO0
N9A
O-L
O9-
OAUVE
OK1
OZT
P-8
P-9
P2P
PC.
Q38
R2-
RIG
ROL
RPZ
SDF
SDG
SDP
SES
SEW
SPC
SPCBC
SSH
SSW
SSZ
T5K
WUQ
XPP
ZMT
~G-
9DU
AAYWO
AAYXX
ACLOT
ACVFH
ADCNI
AEUPX
AFPUW
AGQPQ
AIGII
AIIUN
AKBMS
AKYEP
APXCP
CITATION
EFKBS
EFLBG
~HD
ID FETCH-LOGICAL-c300t-eb492d4f754583d2e0981b481f28c570e7e1dc6daf7ca80f73a1b273b3bd18b43
ISICitedReferencesCount 102
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000798563300003&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 1063-5203
IngestDate Sat Nov 29 07:04:40 EST 2025
Tue Nov 18 22:23:54 EST 2025
Sun Apr 06 06:53:11 EDT 2025
IsPeerReviewed true
IsScholarly true
Keywords Deep learning
PL⁎ condition
Over-parameterized models
Non-linear optimization
Language English
LinkModel OpenURL
MergedId FETCHMERGED-LOGICAL-c300t-eb492d4f754583d2e0981b481f28c570e7e1dc6daf7ca80f73a1b273b3bd18b43
ORCID 0000-0003-0653-7872
PageCount 32
ParticipantIDs crossref_primary_10_1016_j_acha_2021_12_009
crossref_citationtrail_10_1016_j_acha_2021_12_009
elsevier_sciencedirect_doi_10_1016_j_acha_2021_12_009
PublicationCentury 2000
PublicationDate July 2022
2022-07-00
PublicationDateYYYYMMDD 2022-07-01
PublicationDate_xml – month: 07
  year: 2022
  text: July 2022
PublicationDecade 2020
PublicationTitle Applied and computational harmonic analysis
PublicationYear 2022
Publisher Elsevier Inc
Publisher_xml – name: Elsevier Inc
References Chen, Dongarra (br0090) 2005; 27
Bassily, Belkin, Ma (br0040) 2018
Li, Ding, Sun (br0220) 2018
Oymak, Soltanolkotabi (br0300) 2020; 1
Yu, Chen (br0370) 1995; 6
Belkin, Hsu, Ma, Mandal (br0050) 2019; 116
Gupta, Balakrishnan, Ramdas (br0150) 2021; 22
Nocedal, Wright (br0290) 2006
Lee, Xiao, Schoenholz, Bahri, Novak, Sohl-Dickstein, Pennington (br0210) 2019
Mei, Montanari, Nguyen (br0260) 2018; 115
He, Zhang, Ren, Sun (br0160) 2016
Du, Lee, Li, Wang, Zhai (br0130) 2019
Spigler, Geiger, d'Ascoli, Sagun, Biroli, Wyart (br0340) Oct. 2019; 52
Liu, Zhu, Belkin (br0240) 2020; vol. 33
Wensing, Slotine (br0360) 2020; 15
Fedus, Zoph, Shazeer (br0140) 2021
Zou, Cao, Zhou, Gu (br0380) 2020; 109
Bartlett, Helmbold, Long (br0030) 2019; 31
Lederer (br0200) 2020
Brown, Mann, Ryder, Subbiah, Kaplan, Dhariwal, Neelakantan, Shyam, Sastry, Askell, Agarwal, Herbert-Voss, Krueger, Henighan, Child, Ramesh, Ziegler, Wu, Winter, Hesse, Chen, Sigler, Litwin, Gray, Chess, Clark, Berner, McCandlish, Radford, Sutskever, Amodei (br0060) 2020
Lojasiewicz (br0250) 1963; vol. 117
Nguyen, Mukkamala, Hein (br0280) 2018
Poggio, Kur, Banburski (br0310) 2019
Kingma, Ba (br0190) 2015
Arora, Du, Hu, Li, Wang (br0020) 2019
Charles, Papailiopoulos (br0080) 2018
Liu, Belkin (br0230) 2020
Cooper (br0110) 2021; 3
Vaswani, Bach, Schmidt (br0350) 2019
Nesterov (br0270) 1983; 269
Jacot, Gabriel, Hongler (br0170) 2018
Du, Zhai, Poczos, Singh (br0120) 2018
Soltanolkotabi, Javanmard, Lee (br0330) 2018; 65
Burgisser, Cucker (br0070) 2013
Ji, Telgarsky (br0180) 2019
Chizat, Oyallon, Bach (br0100) 2019
Polyak (br0320) 1963; 3
Allen-Zhu, Li, Song (br0010) 2019
Lederer (10.1016/j.acha.2021.12.009_br0200)
Fedus (10.1016/j.acha.2021.12.009_br0140)
Li (10.1016/j.acha.2021.12.009_br0220)
Wensing (10.1016/j.acha.2021.12.009_br0360) 2020; 15
Charles (10.1016/j.acha.2021.12.009_br0080) 2018
Chizat (10.1016/j.acha.2021.12.009_br0100) 2019
Polyak (10.1016/j.acha.2021.12.009_br0320) 1963; 3
Burgisser (10.1016/j.acha.2021.12.009_br0070) 2013
Nocedal (10.1016/j.acha.2021.12.009_br0290) 2006
Liu (10.1016/j.acha.2021.12.009_br0230) 2020
Jacot (10.1016/j.acha.2021.12.009_br0170) 2018
Nguyen (10.1016/j.acha.2021.12.009_br0280) 2018
Zou (10.1016/j.acha.2021.12.009_br0380) 2020; 109
Ji (10.1016/j.acha.2021.12.009_br0180) 2019
Belkin (10.1016/j.acha.2021.12.009_br0050) 2019; 116
Mei (10.1016/j.acha.2021.12.009_br0260) 2018; 115
Lojasiewicz (10.1016/j.acha.2021.12.009_br0250) 1963; vol. 117
Nesterov (10.1016/j.acha.2021.12.009_br0270) 1983; 269
Oymak (10.1016/j.acha.2021.12.009_br0300) 2020; 1
Allen-Zhu (10.1016/j.acha.2021.12.009_br0010) 2019
Bassily (10.1016/j.acha.2021.12.009_br0040)
Lee (10.1016/j.acha.2021.12.009_br0210) 2019
Soltanolkotabi (10.1016/j.acha.2021.12.009_br0330) 2018; 65
Du (10.1016/j.acha.2021.12.009_br0120) 2018
Brown (10.1016/j.acha.2021.12.009_br0060) 2020
Arora (10.1016/j.acha.2021.12.009_br0020) 2019
Poggio (10.1016/j.acha.2021.12.009_br0310)
Vaswani (10.1016/j.acha.2021.12.009_br0350) 2019
Yu (10.1016/j.acha.2021.12.009_br0370) 1995; 6
Kingma (10.1016/j.acha.2021.12.009_br0190) 2015
Gupta (10.1016/j.acha.2021.12.009_br0150) 2021; 22
He (10.1016/j.acha.2021.12.009_br0160) 2016
Bartlett (10.1016/j.acha.2021.12.009_br0030) 2019; 31
Chen (10.1016/j.acha.2021.12.009_br0090) 2005; 27
Cooper (10.1016/j.acha.2021.12.009_br0110) 2021; 3
Du (10.1016/j.acha.2021.12.009_br0130) 2019
Liu (10.1016/j.acha.2021.12.009_br0240) 2020; vol. 33
Spigler (10.1016/j.acha.2021.12.009_br0340) 2019; 52
References_xml – year: 2018
  ident: br0220
  article-title: Over-parameterized deep neural networks have no strict local minima for any continuous activations
– volume: 15
  start-page: e0236661
  year: 2020
  ident: br0360
  article-title: Beyond convexity—contraction and global convergence of gradient descent
  publication-title: PLoS ONE
– volume: 6
  start-page: 1300
  year: 1995
  end-page: 1303
  ident: br0370
  article-title: On the local minima free condition of backpropagation learning
  publication-title: IEEE Trans. Neural Netw.
– start-page: 1195
  year: 2019
  end-page: 1204
  ident: br0350
  article-title: Fast and faster convergence of SGD for overparameterized models and an accelerated perceptron
  publication-title: The 22nd International Conference on Artificial Intelligence and Statistics
– start-page: 242
  year: 2019
  end-page: 252
  ident: br0010
  article-title: A convergence theory for deep learning via over-parameterization
  publication-title: International Conference on Machine Learning
– start-page: 2933
  year: 2019
  end-page: 2943
  ident: br0100
  article-title: On lazy training in differentiable programming
  publication-title: Advances in Neural Information Processing Systems
– start-page: 770
  year: 2016
  end-page: 778
  ident: br0160
  article-title: Deep residual learning for image recognition
  publication-title: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
– year: 2018
  ident: br0280
  article-title: On the loss landscape of a class of deep neural networks with no bad local valleys
  publication-title: International Conference on Learning Representations
– volume: 22
  start-page: 1
  year: 2021
  end-page: 63
  ident: br0150
  article-title: Path length bounds for gradient descent and flow
  publication-title: J. Mach. Learn. Res.
– volume: 115
  start-page: E7665
  year: 2018
  end-page: E7671
  ident: br0260
  article-title: A mean field view of the landscape of two-layer neural networks
  publication-title: Proc. Natl. Acad. Sci.
– start-page: 322
  year: 2019
  end-page: 332
  ident: br0020
  article-title: Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks
  publication-title: International Conference on Machine Learning
– volume: 52
  start-page: 474001
  year: Oct. 2019
  ident: br0340
  article-title: A jamming transition from under- to over-parametrization affects generalization in deep learning
  publication-title: J. Phys. A, Math. Theor.
– volume: 1
  start-page: 84
  year: 2020
  end-page: 105
  ident: br0300
  article-title: Toward moderate overparameterization: global convergence guarantees for training shallow neural networks
  publication-title: IEEE J. Sel. Areas Inf. Theory
– start-page: 1675
  year: 2019
  end-page: 1685
  ident: br0130
  article-title: Gradient descent finds global minima of deep neural networks
  publication-title: International Conference on Machine Learning
– year: 2021
  ident: br0140
  article-title: Switch transformers: scaling to trillion parameter models with simple and efficient sparsity
– year: 2006
  ident: br0290
  article-title: Numerical Optimization
– year: 2019
  ident: br0180
  article-title: Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow ReLU networks
  publication-title: International Conference on Learning Representations
– volume: 3
  start-page: 676
  year: 2021
  end-page: 691
  ident: br0110
  article-title: Global minima of overparameterized neural networks
  publication-title: SIAM J. Math. Data Sci.
– year: 2020
  ident: br0230
  article-title: Accelerating SGD with momentum for over-parameterized learning
  publication-title: The 8th International Conference on Learning Representations
– year: 2015
  ident: br0190
  article-title: Adam: a method for stochastic optimization
  publication-title: ICLR
– year: 2018
  ident: br0120
  article-title: Gradient descent provably optimizes over-parameterized neural networks
  publication-title: International Conference on Learning Representations
– start-page: 8571
  year: 2018
  end-page: 8580
  ident: br0170
  article-title: Neural tangent kernel: convergence and generalization in neural networks
  publication-title: Advances in Neural Information Processing Systems
– volume: vol. 117
  start-page: 87
  year: 1963
  end-page: 89
  ident: br0250
  article-title: A topological property of real analytic subsets
  publication-title: Coll. du CNRS, Les Equations aux dérivées partielles
– volume: 27
  start-page: 603
  year: 2005
  end-page: 620
  ident: br0090
  article-title: Condition numbers of Gaussian random matrices
  publication-title: SIAM J. Matrix Anal. Appl.
– volume: 65
  start-page: 742
  year: 2018
  end-page: 769
  ident: br0330
  article-title: Theoretical insights into the optimization landscape of over-parameterized shallow neural networks
  publication-title: IEEE Trans. Inf. Theory
– start-page: 8570
  year: 2019
  end-page: 8581
  ident: br0210
  article-title: Wide neural networks of any depth evolve as linear models under gradient descent
  publication-title: Advances in Neural Information Processing Systems
– volume: 31
  start-page: 477
  year: 2019
  end-page: 502
  ident: br0030
  article-title: Gradient descent with identity initialization efficiently learns positive-definite linear transformations by deep residual networks
  publication-title: Neural Comput.
– volume: vol. 33
  year: 2020
  ident: br0240
  article-title: On the linearity of large non-linear models: when and why the tangent kernel is constant
  publication-title: Advances in Neural Information Processing Systems
– start-page: 1877
  year: 2020
  end-page: 1901
  ident: br0060
  article-title: Language models are few-shot learners
  publication-title: Advances in Neural Information Processing Systems
– volume: 116
  start-page: 15849
  year: 2019
  end-page: 15854
  ident: br0050
  article-title: Reconciling modern machine- learning practice and the classical bias-variance trade-off
  publication-title: Proc. Natl. Acad. Sci.
– year: 2013
  ident: br0070
  article-title: Condition: The Geometry of Numerical Algorithms, vol. 349
– year: 2020
  ident: br0200
  article-title: No spurious local minima: on the optimization landscapes of wide and deep neural networks
– volume: 109
  start-page: 467
  year: 2020
  end-page: 492
  ident: br0380
  article-title: Gradient descent optimizes overparameterized deep ReLU networks
  publication-title: Mach. Learn.
– volume: 269
  start-page: 543
  year: 1983
  end-page: 547
  ident: br0270
  article-title: A method for unconstrained convex minimization problem with the rate of convergence O (1/k
  publication-title: Dokl. Acad. Nauk USSR
– year: 2019
  ident: br0310
  article-title: Double descent in the condition number
– year: 2018
  ident: br0040
  article-title: On exponential convergence of SGD in non-convex over-parametrized learning
– start-page: 745
  year: 2018
  end-page: 754
  ident: br0080
  article-title: Stability and generalization of learning algorithms that converge to global optima
  publication-title: International Conference on Machine Learning
– volume: 3
  start-page: 643
  year: 1963
  end-page: 653
  ident: br0320
  article-title: Gradient methods for minimizing functionals
  publication-title: Ž. Vyčisl. Mat. Mat. Fiz.
– volume: 1
  start-page: 84
  issue: 1
  year: 2020
  ident: 10.1016/j.acha.2021.12.009_br0300
  article-title: Toward moderate overparameterization: global convergence guarantees for training shallow neural networks
  publication-title: IEEE J. Sel. Areas Inf. Theory
  doi: 10.1109/JSAIT.2020.2991332
– volume: 52
  start-page: 474001
  issue: 47
  year: 2019
  ident: 10.1016/j.acha.2021.12.009_br0340
  article-title: A jamming transition from under- to over-parametrization affects generalization in deep learning
  publication-title: J. Phys. A, Math. Theor.
  doi: 10.1088/1751-8121/ab4c8b
– start-page: 1877
  year: 2020
  ident: 10.1016/j.acha.2021.12.009_br0060
  article-title: Language models are few-shot learners
– start-page: 770
  year: 2016
  ident: 10.1016/j.acha.2021.12.009_br0160
  article-title: Deep residual learning for image recognition
– year: 2020
  ident: 10.1016/j.acha.2021.12.009_br0230
  article-title: Accelerating SGD with momentum for over-parameterized learning
– volume: 31
  start-page: 477
  issue: 3
  year: 2019
  ident: 10.1016/j.acha.2021.12.009_br0030
  article-title: Gradient descent with identity initialization efficiently learns positive-definite linear transformations by deep residual networks
  publication-title: Neural Comput.
  doi: 10.1162/neco_a_01164
– volume: 22
  start-page: 1
  issue: 68
  year: 2021
  ident: 10.1016/j.acha.2021.12.009_br0150
  article-title: Path length bounds for gradient descent and flow
  publication-title: J. Mach. Learn. Res.
– volume: 27
  start-page: 603
  issue: 3
  year: 2005
  ident: 10.1016/j.acha.2021.12.009_br0090
  article-title: Condition numbers of Gaussian random matrices
  publication-title: SIAM J. Matrix Anal. Appl.
  doi: 10.1137/040616413
– volume: 15
  start-page: e0236661
  issue: 8
  year: 2020
  ident: 10.1016/j.acha.2021.12.009_br0360
  article-title: Beyond convexity—contraction and global convergence of gradient descent
  publication-title: PLoS ONE
  doi: 10.1371/journal.pone.0236661
– ident: 10.1016/j.acha.2021.12.009_br0140
– year: 2019
  ident: 10.1016/j.acha.2021.12.009_br0180
  article-title: Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow ReLU networks
– volume: 6
  start-page: 1300
  issue: 5
  year: 1995
  ident: 10.1016/j.acha.2021.12.009_br0370
  article-title: On the local minima free condition of backpropagation learning
  publication-title: IEEE Trans. Neural Netw.
  doi: 10.1109/72.410380
– ident: 10.1016/j.acha.2021.12.009_br0220
– start-page: 242
  year: 2019
  ident: 10.1016/j.acha.2021.12.009_br0010
  article-title: A convergence theory for deep learning via over-parameterization
– volume: 65
  start-page: 742
  issue: 2
  year: 2018
  ident: 10.1016/j.acha.2021.12.009_br0330
  article-title: Theoretical insights into the optimization landscape of over-parameterized shallow neural networks
  publication-title: IEEE Trans. Inf. Theory
  doi: 10.1109/TIT.2018.2854560
– ident: 10.1016/j.acha.2021.12.009_br0040
– volume: vol. 117
  start-page: 87
  year: 1963
  ident: 10.1016/j.acha.2021.12.009_br0250
  article-title: A topological property of real analytic subsets
– year: 2018
  ident: 10.1016/j.acha.2021.12.009_br0120
  article-title: Gradient descent provably optimizes over-parameterized neural networks
– ident: 10.1016/j.acha.2021.12.009_br0200
– volume: 3
  start-page: 676
  issue: 2
  year: 2021
  ident: 10.1016/j.acha.2021.12.009_br0110
  article-title: Global minima of overparameterized neural networks
  publication-title: SIAM J. Math. Data Sci.
  doi: 10.1137/19M1308943
– start-page: 8571
  year: 2018
  ident: 10.1016/j.acha.2021.12.009_br0170
  article-title: Neural tangent kernel: convergence and generalization in neural networks
– volume: 115
  start-page: E7665
  issue: 33
  year: 2018
  ident: 10.1016/j.acha.2021.12.009_br0260
  article-title: A mean field view of the landscape of two-layer neural networks
  publication-title: Proc. Natl. Acad. Sci.
  doi: 10.1073/pnas.1806579115
– year: 2013
  ident: 10.1016/j.acha.2021.12.009_br0070
– volume: 116
  start-page: 15849
  issue: 32
  year: 2019
  ident: 10.1016/j.acha.2021.12.009_br0050
  article-title: Reconciling modern machine- learning practice and the classical bias-variance trade-off
  publication-title: Proc. Natl. Acad. Sci.
  doi: 10.1073/pnas.1903070116
– volume: 269
  start-page: 543
  year: 1983
  ident: 10.1016/j.acha.2021.12.009_br0270
  article-title: A method for unconstrained convex minimization problem with the rate of convergence O (1/k2)
  publication-title: Dokl. Acad. Nauk USSR
– start-page: 1195
  year: 2019
  ident: 10.1016/j.acha.2021.12.009_br0350
  article-title: Fast and faster convergence of SGD for overparameterized models and an accelerated perceptron
– volume: 3
  start-page: 643
  issue: 4
  year: 1963
  ident: 10.1016/j.acha.2021.12.009_br0320
  article-title: Gradient methods for minimizing functionals
  publication-title: Ž. Vyčisl. Mat. Mat. Fiz.
– start-page: 745
  year: 2018
  ident: 10.1016/j.acha.2021.12.009_br0080
  article-title: Stability and generalization of learning algorithms that converge to global optima
– volume: vol. 33
  year: 2020
  ident: 10.1016/j.acha.2021.12.009_br0240
  article-title: On the linearity of large non-linear models: when and why the tangent kernel is constant
– year: 2015
  ident: 10.1016/j.acha.2021.12.009_br0190
  article-title: Adam: a method for stochastic optimization
– year: 2006
  ident: 10.1016/j.acha.2021.12.009_br0290
– ident: 10.1016/j.acha.2021.12.009_br0310
– volume: 109
  start-page: 467
  issue: 3
  year: 2020
  ident: 10.1016/j.acha.2021.12.009_br0380
  article-title: Gradient descent optimizes overparameterized deep ReLU networks
  publication-title: Mach. Learn.
  doi: 10.1007/s10994-019-05839-6
– start-page: 8570
  year: 2019
  ident: 10.1016/j.acha.2021.12.009_br0210
  article-title: Wide neural networks of any depth evolve as linear models under gradient descent
– start-page: 322
  year: 2019
  ident: 10.1016/j.acha.2021.12.009_br0020
  article-title: Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks
– year: 2018
  ident: 10.1016/j.acha.2021.12.009_br0280
  article-title: On the loss landscape of a class of deep neural networks with no bad local valleys
– start-page: 2933
  year: 2019
  ident: 10.1016/j.acha.2021.12.009_br0100
  article-title: On lazy training in differentiable programming
– start-page: 1675
  year: 2019
  ident: 10.1016/j.acha.2021.12.009_br0130
  article-title: Gradient descent finds global minima of deep neural networks
SSID ssj0011459
Score 2.7008417
Snippet The success of deep learning is due, to a large extent, to the remarkable effectiveness of gradient-based optimization methods applied to large neural...
SourceID crossref
elsevier
SourceType Enrichment Source
Index Database
Publisher
StartPage 85
SubjectTerms Deep learning
Non-linear optimization
Over-parameterized models
PL⁎ condition
Title Loss landscapes and optimization in over-parameterized non-linear systems and neural networks
URI https://dx.doi.org/10.1016/j.acha.2021.12.009
Volume 59
WOSCitedRecordID wos000798563300003&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVESC
  databaseName: Elsevier SD Freedom Collection Journals 2021
  issn: 1063-5203
  databaseCode: AIEXJ
  dateStart: 20211209
  customDbUrl:
  isFulltext: true
  dateEnd: 99991231
  titleUrlDefault: https://www.sciencedirect.com
  omitProxy: false
  ssIdentifier: ssj0011459
  providerName: Elsevier
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV3fa9UwFA5654M-DH_inEoefBuFpGlvmsdtbKi4ITjhIkhJ2pTb_ei97N7K5l_vOU3SVidDBV9Kb2luSs7X9OTkO98h5E2VlFOrjYnAt5_CAiXDQBMvIy1EGeskZtNSd8Um5PFxNpupjz6Ys-rKCcimya6u1PK_mhqugbExdfYvzN3_KVyAczA6HMHscPwjw3-Az95Ol8GL3CYnwbyAieHCZ1xihAN5mxGqfl8gG6b-jhyARROhy6mDurNriXqXYMXGscVXY182OLA-M27ZrkNgEeWwu8o62kue9LSfuvV7_IvrtofUl3nrwgOm7qG6Z899obCj-mweiCA-PBEPVFYfMwt5Mz_ROmEdKmAJzMR4HvbK4G4idXV8_CeZu3TMG7O9CzycAvLmKCEV8y6yy9TwbesZh5-wS-wx5p1M3uwu2YhlqrIJ2dh9dzB732898aSrsNc_os-0cqTAX3v6vTcz8lBOHpJNv7Sguw4Sj8gd2zwmD0aCk_DrqFfpXT0hXxEqdIAKhTM6hgqtG3oTKnSACvVQ6Vo6qNAAlafk8-HByf7byJfbiArB2DqyJlFxmVQS91LhTbVMwZomyXgVZ0UqmZWWlwW8u5UsdMYqKTQ34P0aYUqemUQ8IxN4APucUKYVq8APKq1KE1ggGzXVxrLUWJlaycUW4WHY8sJr0WNJlPM8kA5PcxzqHIc653EOQ71Fdvo2S6fEcuvdabBG7n1J5yPmAJ5b2r34x3bb5P6A_pdksr5s7Styr_i2rleXrz3GfgAkW51h
linkProvider Elsevier
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Loss+landscapes+and+optimization+in+over-parameterized+non-linear+systems+and+neural+networks&rft.jtitle=Applied+and+computational+harmonic+analysis&rft.au=Liu%2C+Chaoyue&rft.au=Zhu%2C+Libin&rft.au=Belkin%2C+Mikhail&rft.date=2022-07-01&rft.pub=Elsevier+Inc&rft.issn=1063-5203&rft.volume=59&rft.spage=85&rft.epage=116&rft_id=info:doi/10.1016%2Fj.acha.2021.12.009&rft.externalDocID=S106352032100110X
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1063-5203&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1063-5203&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1063-5203&client=summon