Loss landscapes and optimization in over-parameterized non-linear systems and neural networks
The success of deep learning is due, to a large extent, to the remarkable effectiveness of gradient-based optimization methods applied to large neural networks. The purpose of this work is to propose a modern view and a general mathematical framework for loss landscapes and efficient optimization in...
Gespeichert in:
| Veröffentlicht in: | Applied and computational harmonic analysis Jg. 59; S. 85 - 116 |
|---|---|
| Hauptverfasser: | , , |
| Format: | Journal Article |
| Sprache: | Englisch |
| Veröffentlicht: |
Elsevier Inc
01.07.2022
|
| Schlagworte: | |
| ISSN: | 1063-5203 |
| Online-Zugang: | Volltext |
| Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
| Abstract | The success of deep learning is due, to a large extent, to the remarkable effectiveness of gradient-based optimization methods applied to large neural networks. The purpose of this work is to propose a modern view and a general mathematical framework for loss landscapes and efficient optimization in over-parameterized machine learning models and systems of non-linear equations, a setting that includes over-parameterized deep neural networks. Our starting observation is that optimization landscapes corresponding to such systems are generally not convex, even locally around a global minimum, a condition we call essential non-convexity. We argue that instead they satisfy PL⁎, a variant of the Polyak-Łojasiewicz condition [32,25] on most (but not all) of the parameter space, which guarantees both the existence of solutions and efficient optimization by (stochastic) gradient descent (SGD/GD). The PL⁎ condition of these systems is closely related to the condition number of the tangent kernel associated to a non-linear system showing how a PL⁎-based non-linear theory parallels classical analyses of over-parameterized linear equations. We show that wide neural networks satisfy the PL⁎ condition, which explains the (S)GD convergence to a global minimum. Finally we propose a relaxation of the PL⁎ condition applicable to “almost” over-parameterized systems. |
|---|---|
| AbstractList | The success of deep learning is due, to a large extent, to the remarkable effectiveness of gradient-based optimization methods applied to large neural networks. The purpose of this work is to propose a modern view and a general mathematical framework for loss landscapes and efficient optimization in over-parameterized machine learning models and systems of non-linear equations, a setting that includes over-parameterized deep neural networks. Our starting observation is that optimization landscapes corresponding to such systems are generally not convex, even locally around a global minimum, a condition we call essential non-convexity. We argue that instead they satisfy PL⁎, a variant of the Polyak-Łojasiewicz condition [32,25] on most (but not all) of the parameter space, which guarantees both the existence of solutions and efficient optimization by (stochastic) gradient descent (SGD/GD). The PL⁎ condition of these systems is closely related to the condition number of the tangent kernel associated to a non-linear system showing how a PL⁎-based non-linear theory parallels classical analyses of over-parameterized linear equations. We show that wide neural networks satisfy the PL⁎ condition, which explains the (S)GD convergence to a global minimum. Finally we propose a relaxation of the PL⁎ condition applicable to “almost” over-parameterized systems. |
| Author | Liu, Chaoyue Zhu, Libin Belkin, Mikhail |
| Author_xml | – sequence: 1 givenname: Chaoyue orcidid: 0000-0003-0653-7872 surname: Liu fullname: Liu, Chaoyue organization: Department of Computer Science and Engineering, The Ohio State University, United States of America – sequence: 2 givenname: Libin surname: Zhu fullname: Zhu, Libin organization: Department of Computer Science and Engineering, University of California, San Diego, United States of America – sequence: 3 givenname: Mikhail surname: Belkin fullname: Belkin, Mikhail email: mbelkin@ucsd.edu organization: Halicioğlu Data Science Institute, University of California, San Diego, United States of America |
| BookMark | eNp9kL1OwzAUhT0UibbwAkx5gYR77aRJJBZU8SdVYoERWY59I1wSJ7JNUfv0pJSJgenc5TtX51uwmRscMXaFkCHg6nqbKf2uMg4cM-QZQD1jc4SVSAsO4pwtQtgCIOZFPWdvmyGEpFPOBK1GCsl0JcMYbW8PKtrBJdYlw458Oiqveork7YFMMv1MO-tI-STsQ6T-RDr69KqbIn4N_iNcsLNWdYEuf3PJXu_vXtaP6eb54Wl9u0m1AIgpNXnNTd6WRV5UwnCCusImr7DllS5KoJLQ6JVRbalVBW0pFDa8FI1oDFZNLpaMn3q1n-Z4auXoba_8XiLIoxS5lUcp8ihFIpeTlAmq_kDaxp_N0Svb_Y_enFCaRu0seRm0JafJWE86SjPY__BvZPWDsQ |
| CitedBy_id | crossref_primary_10_1109_TIT_2025_3570730 crossref_primary_10_1073_pnas_2208779120 crossref_primary_10_1137_23M159860X crossref_primary_10_1109_TAC_2024_3415459 crossref_primary_10_1109_TNNLS_2022_3195909 crossref_primary_10_1016_j_acha_2025_101777 crossref_primary_10_1016_j_neunet_2025_107945 crossref_primary_10_1039_D5NA00258C crossref_primary_10_1109_JLT_2022_3209547 crossref_primary_10_1007_s11222_022_10120_3 crossref_primary_10_1016_j_cma_2023_116647 crossref_primary_10_3390_math11102375 crossref_primary_10_1109_TPAMI_2024_3361861 crossref_primary_10_1016_j_ejco_2022_100035 crossref_primary_10_3389_fphy_2022_1115764 crossref_primary_10_1007_s10107_025_02194_4 crossref_primary_10_1088_1361_6420_accb08 crossref_primary_10_1088_2632_2153_acf115 crossref_primary_10_1186_s13104_025_07142_1 crossref_primary_10_3390_e25081153 crossref_primary_10_1109_JLT_2024_3372015 crossref_primary_10_1016_j_acha_2023_03_004 crossref_primary_10_1137_24M1636186 crossref_primary_10_1016_j_neunet_2024_107110 crossref_primary_10_1016_j_cma_2024_117727 crossref_primary_10_1073_pnas_2301345120 crossref_primary_10_1007_s10107_024_02140_w crossref_primary_10_1007_s10107_024_02136_6 crossref_primary_10_1016_j_cam_2024_116041 crossref_primary_10_1109_TMC_2025_3538871 crossref_primary_10_1134_S1064562424602336 crossref_primary_10_1109_TCYB_2024_3384924 crossref_primary_10_1109_TSP_2025_3533208 crossref_primary_10_1007_s10915_025_02992_0 crossref_primary_10_1007_s10589_025_00656_1 crossref_primary_10_1080_02331934_2025_2551250 crossref_primary_10_3390_jimaging9110238 crossref_primary_10_1002_nla_2558 crossref_primary_10_1016_j_neucom_2023_02_054 crossref_primary_10_1038_s41467_024_54451_3 crossref_primary_10_1007_s10107_024_02154_4 crossref_primary_10_3390_en18051265 crossref_primary_10_1109_TIT_2022_3215088 crossref_primary_10_1145_3629978 crossref_primary_10_1016_j_camwa_2025_02_004 crossref_primary_10_1016_j_spa_2025_104763 crossref_primary_10_1007_s10994_025_06737_w crossref_primary_10_1109_TSP_2024_3412981 crossref_primary_10_1109_TAC_2023_3328510 crossref_primary_10_1109_TPAMI_2022_3196503 crossref_primary_10_1016_j_neunet_2024_106963 |
| Cites_doi | 10.1109/JSAIT.2020.2991332 10.1088/1751-8121/ab4c8b 10.1162/neco_a_01164 10.1137/040616413 10.1371/journal.pone.0236661 10.1109/72.410380 10.1109/TIT.2018.2854560 10.1137/19M1308943 10.1073/pnas.1806579115 10.1073/pnas.1903070116 10.1007/s10994-019-05839-6 |
| ContentType | Journal Article |
| Copyright | 2021 Elsevier Inc. |
| Copyright_xml | – notice: 2021 Elsevier Inc. |
| DBID | AAYXX CITATION |
| DOI | 10.1016/j.acha.2021.12.009 |
| DatabaseName | CrossRef |
| DatabaseTitle | CrossRef |
| DatabaseTitleList | |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Engineering Mathematics |
| EndPage | 116 |
| ExternalDocumentID | 10_1016_j_acha_2021_12_009 S106352032100110X |
| GroupedDBID | --K --M .~1 0R~ 1B1 1RT 1~. 1~5 23M 4.4 457 4G. 5GY 5VS 7-5 71M 8P~ 9JN AACTN AAEDT AAEDW AAIKJ AAKOC AALRI AAOAW AAQFI AAQXK AASFE AATTM AAXKI AAXUO ABAOU ABFNM ABJNI ABMAC ABWVN ABXDB ACDAQ ACGFS ACRLP ACRPL ADBBV ADEZE ADFGL ADMUD ADNMO ADVLN AEBSH AEIPS AEKER AENEX AEXQZ AFJKZ AFTJW AGHFR AGUBO AGYEJ AHHHB AIEXJ AIGVJ AIKHN AITUG AKRWK ALMA_UNASSIGNED_HOLDINGS AMRAJ ANKPU ARUGR ASPBG AVWKF AXJTR AZFZN BKOJK BLXMC BNPGV CAG COF CS3 DM4 EBS EFBJH EJD EO8 EO9 EP2 EP3 F5P FDB FEDTE FGOYB FIRID FNPLU FYGXN G-2 G-Q GBLVA HVGLF HZ~ IHE IXB J1W KOM LG5 M26 M41 MCRUF MHUIS MO0 N9A O-L O9- OAUVE OK1 OZT P-8 P-9 P2P PC. Q38 R2- RIG ROL RPZ SDF SDG SDP SES SEW SPC SPCBC SSH SSW SSZ T5K WUQ XPP ZMT ~G- 9DU AAYWO AAYXX ACLOT ACVFH ADCNI AEUPX AFPUW AGQPQ AIGII AIIUN AKBMS AKYEP APXCP CITATION EFKBS EFLBG ~HD |
| ID | FETCH-LOGICAL-c300t-eb492d4f754583d2e0981b481f28c570e7e1dc6daf7ca80f73a1b273b3bd18b43 |
| ISICitedReferencesCount | 102 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000798563300003&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 1063-5203 |
| IngestDate | Sat Nov 29 07:04:40 EST 2025 Tue Nov 18 22:23:54 EST 2025 Sun Apr 06 06:53:11 EDT 2025 |
| IsPeerReviewed | true |
| IsScholarly | true |
| Keywords | Deep learning PL⁎ condition Over-parameterized models Non-linear optimization |
| Language | English |
| LinkModel | OpenURL |
| MergedId | FETCHMERGED-LOGICAL-c300t-eb492d4f754583d2e0981b481f28c570e7e1dc6daf7ca80f73a1b273b3bd18b43 |
| ORCID | 0000-0003-0653-7872 |
| PageCount | 32 |
| ParticipantIDs | crossref_primary_10_1016_j_acha_2021_12_009 crossref_citationtrail_10_1016_j_acha_2021_12_009 elsevier_sciencedirect_doi_10_1016_j_acha_2021_12_009 |
| PublicationCentury | 2000 |
| PublicationDate | July 2022 2022-07-00 |
| PublicationDateYYYYMMDD | 2022-07-01 |
| PublicationDate_xml | – month: 07 year: 2022 text: July 2022 |
| PublicationDecade | 2020 |
| PublicationTitle | Applied and computational harmonic analysis |
| PublicationYear | 2022 |
| Publisher | Elsevier Inc |
| Publisher_xml | – name: Elsevier Inc |
| References | Chen, Dongarra (br0090) 2005; 27 Bassily, Belkin, Ma (br0040) 2018 Li, Ding, Sun (br0220) 2018 Oymak, Soltanolkotabi (br0300) 2020; 1 Yu, Chen (br0370) 1995; 6 Belkin, Hsu, Ma, Mandal (br0050) 2019; 116 Gupta, Balakrishnan, Ramdas (br0150) 2021; 22 Nocedal, Wright (br0290) 2006 Lee, Xiao, Schoenholz, Bahri, Novak, Sohl-Dickstein, Pennington (br0210) 2019 Mei, Montanari, Nguyen (br0260) 2018; 115 He, Zhang, Ren, Sun (br0160) 2016 Du, Lee, Li, Wang, Zhai (br0130) 2019 Spigler, Geiger, d'Ascoli, Sagun, Biroli, Wyart (br0340) Oct. 2019; 52 Liu, Zhu, Belkin (br0240) 2020; vol. 33 Wensing, Slotine (br0360) 2020; 15 Fedus, Zoph, Shazeer (br0140) 2021 Zou, Cao, Zhou, Gu (br0380) 2020; 109 Bartlett, Helmbold, Long (br0030) 2019; 31 Lederer (br0200) 2020 Brown, Mann, Ryder, Subbiah, Kaplan, Dhariwal, Neelakantan, Shyam, Sastry, Askell, Agarwal, Herbert-Voss, Krueger, Henighan, Child, Ramesh, Ziegler, Wu, Winter, Hesse, Chen, Sigler, Litwin, Gray, Chess, Clark, Berner, McCandlish, Radford, Sutskever, Amodei (br0060) 2020 Lojasiewicz (br0250) 1963; vol. 117 Nguyen, Mukkamala, Hein (br0280) 2018 Poggio, Kur, Banburski (br0310) 2019 Kingma, Ba (br0190) 2015 Arora, Du, Hu, Li, Wang (br0020) 2019 Charles, Papailiopoulos (br0080) 2018 Liu, Belkin (br0230) 2020 Cooper (br0110) 2021; 3 Vaswani, Bach, Schmidt (br0350) 2019 Nesterov (br0270) 1983; 269 Jacot, Gabriel, Hongler (br0170) 2018 Du, Zhai, Poczos, Singh (br0120) 2018 Soltanolkotabi, Javanmard, Lee (br0330) 2018; 65 Burgisser, Cucker (br0070) 2013 Ji, Telgarsky (br0180) 2019 Chizat, Oyallon, Bach (br0100) 2019 Polyak (br0320) 1963; 3 Allen-Zhu, Li, Song (br0010) 2019 Lederer (10.1016/j.acha.2021.12.009_br0200) Fedus (10.1016/j.acha.2021.12.009_br0140) Li (10.1016/j.acha.2021.12.009_br0220) Wensing (10.1016/j.acha.2021.12.009_br0360) 2020; 15 Charles (10.1016/j.acha.2021.12.009_br0080) 2018 Chizat (10.1016/j.acha.2021.12.009_br0100) 2019 Polyak (10.1016/j.acha.2021.12.009_br0320) 1963; 3 Burgisser (10.1016/j.acha.2021.12.009_br0070) 2013 Nocedal (10.1016/j.acha.2021.12.009_br0290) 2006 Liu (10.1016/j.acha.2021.12.009_br0230) 2020 Jacot (10.1016/j.acha.2021.12.009_br0170) 2018 Nguyen (10.1016/j.acha.2021.12.009_br0280) 2018 Zou (10.1016/j.acha.2021.12.009_br0380) 2020; 109 Ji (10.1016/j.acha.2021.12.009_br0180) 2019 Belkin (10.1016/j.acha.2021.12.009_br0050) 2019; 116 Mei (10.1016/j.acha.2021.12.009_br0260) 2018; 115 Lojasiewicz (10.1016/j.acha.2021.12.009_br0250) 1963; vol. 117 Nesterov (10.1016/j.acha.2021.12.009_br0270) 1983; 269 Oymak (10.1016/j.acha.2021.12.009_br0300) 2020; 1 Allen-Zhu (10.1016/j.acha.2021.12.009_br0010) 2019 Bassily (10.1016/j.acha.2021.12.009_br0040) Lee (10.1016/j.acha.2021.12.009_br0210) 2019 Soltanolkotabi (10.1016/j.acha.2021.12.009_br0330) 2018; 65 Du (10.1016/j.acha.2021.12.009_br0120) 2018 Brown (10.1016/j.acha.2021.12.009_br0060) 2020 Arora (10.1016/j.acha.2021.12.009_br0020) 2019 Poggio (10.1016/j.acha.2021.12.009_br0310) Vaswani (10.1016/j.acha.2021.12.009_br0350) 2019 Yu (10.1016/j.acha.2021.12.009_br0370) 1995; 6 Kingma (10.1016/j.acha.2021.12.009_br0190) 2015 Gupta (10.1016/j.acha.2021.12.009_br0150) 2021; 22 He (10.1016/j.acha.2021.12.009_br0160) 2016 Bartlett (10.1016/j.acha.2021.12.009_br0030) 2019; 31 Chen (10.1016/j.acha.2021.12.009_br0090) 2005; 27 Cooper (10.1016/j.acha.2021.12.009_br0110) 2021; 3 Du (10.1016/j.acha.2021.12.009_br0130) 2019 Liu (10.1016/j.acha.2021.12.009_br0240) 2020; vol. 33 Spigler (10.1016/j.acha.2021.12.009_br0340) 2019; 52 |
| References_xml | – year: 2018 ident: br0220 article-title: Over-parameterized deep neural networks have no strict local minima for any continuous activations – volume: 15 start-page: e0236661 year: 2020 ident: br0360 article-title: Beyond convexity—contraction and global convergence of gradient descent publication-title: PLoS ONE – volume: 6 start-page: 1300 year: 1995 end-page: 1303 ident: br0370 article-title: On the local minima free condition of backpropagation learning publication-title: IEEE Trans. Neural Netw. – start-page: 1195 year: 2019 end-page: 1204 ident: br0350 article-title: Fast and faster convergence of SGD for overparameterized models and an accelerated perceptron publication-title: The 22nd International Conference on Artificial Intelligence and Statistics – start-page: 242 year: 2019 end-page: 252 ident: br0010 article-title: A convergence theory for deep learning via over-parameterization publication-title: International Conference on Machine Learning – start-page: 2933 year: 2019 end-page: 2943 ident: br0100 article-title: On lazy training in differentiable programming publication-title: Advances in Neural Information Processing Systems – start-page: 770 year: 2016 end-page: 778 ident: br0160 article-title: Deep residual learning for image recognition publication-title: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition – year: 2018 ident: br0280 article-title: On the loss landscape of a class of deep neural networks with no bad local valleys publication-title: International Conference on Learning Representations – volume: 22 start-page: 1 year: 2021 end-page: 63 ident: br0150 article-title: Path length bounds for gradient descent and flow publication-title: J. Mach. Learn. Res. – volume: 115 start-page: E7665 year: 2018 end-page: E7671 ident: br0260 article-title: A mean field view of the landscape of two-layer neural networks publication-title: Proc. Natl. Acad. Sci. – start-page: 322 year: 2019 end-page: 332 ident: br0020 article-title: Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks publication-title: International Conference on Machine Learning – volume: 52 start-page: 474001 year: Oct. 2019 ident: br0340 article-title: A jamming transition from under- to over-parametrization affects generalization in deep learning publication-title: J. Phys. A, Math. Theor. – volume: 1 start-page: 84 year: 2020 end-page: 105 ident: br0300 article-title: Toward moderate overparameterization: global convergence guarantees for training shallow neural networks publication-title: IEEE J. Sel. Areas Inf. Theory – start-page: 1675 year: 2019 end-page: 1685 ident: br0130 article-title: Gradient descent finds global minima of deep neural networks publication-title: International Conference on Machine Learning – year: 2021 ident: br0140 article-title: Switch transformers: scaling to trillion parameter models with simple and efficient sparsity – year: 2006 ident: br0290 article-title: Numerical Optimization – year: 2019 ident: br0180 article-title: Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow ReLU networks publication-title: International Conference on Learning Representations – volume: 3 start-page: 676 year: 2021 end-page: 691 ident: br0110 article-title: Global minima of overparameterized neural networks publication-title: SIAM J. Math. Data Sci. – year: 2020 ident: br0230 article-title: Accelerating SGD with momentum for over-parameterized learning publication-title: The 8th International Conference on Learning Representations – year: 2015 ident: br0190 article-title: Adam: a method for stochastic optimization publication-title: ICLR – year: 2018 ident: br0120 article-title: Gradient descent provably optimizes over-parameterized neural networks publication-title: International Conference on Learning Representations – start-page: 8571 year: 2018 end-page: 8580 ident: br0170 article-title: Neural tangent kernel: convergence and generalization in neural networks publication-title: Advances in Neural Information Processing Systems – volume: vol. 117 start-page: 87 year: 1963 end-page: 89 ident: br0250 article-title: A topological property of real analytic subsets publication-title: Coll. du CNRS, Les Equations aux dérivées partielles – volume: 27 start-page: 603 year: 2005 end-page: 620 ident: br0090 article-title: Condition numbers of Gaussian random matrices publication-title: SIAM J. Matrix Anal. Appl. – volume: 65 start-page: 742 year: 2018 end-page: 769 ident: br0330 article-title: Theoretical insights into the optimization landscape of over-parameterized shallow neural networks publication-title: IEEE Trans. Inf. Theory – start-page: 8570 year: 2019 end-page: 8581 ident: br0210 article-title: Wide neural networks of any depth evolve as linear models under gradient descent publication-title: Advances in Neural Information Processing Systems – volume: 31 start-page: 477 year: 2019 end-page: 502 ident: br0030 article-title: Gradient descent with identity initialization efficiently learns positive-definite linear transformations by deep residual networks publication-title: Neural Comput. – volume: vol. 33 year: 2020 ident: br0240 article-title: On the linearity of large non-linear models: when and why the tangent kernel is constant publication-title: Advances in Neural Information Processing Systems – start-page: 1877 year: 2020 end-page: 1901 ident: br0060 article-title: Language models are few-shot learners publication-title: Advances in Neural Information Processing Systems – volume: 116 start-page: 15849 year: 2019 end-page: 15854 ident: br0050 article-title: Reconciling modern machine- learning practice and the classical bias-variance trade-off publication-title: Proc. Natl. Acad. Sci. – year: 2013 ident: br0070 article-title: Condition: The Geometry of Numerical Algorithms, vol. 349 – year: 2020 ident: br0200 article-title: No spurious local minima: on the optimization landscapes of wide and deep neural networks – volume: 109 start-page: 467 year: 2020 end-page: 492 ident: br0380 article-title: Gradient descent optimizes overparameterized deep ReLU networks publication-title: Mach. Learn. – volume: 269 start-page: 543 year: 1983 end-page: 547 ident: br0270 article-title: A method for unconstrained convex minimization problem with the rate of convergence O (1/k publication-title: Dokl. Acad. Nauk USSR – year: 2019 ident: br0310 article-title: Double descent in the condition number – year: 2018 ident: br0040 article-title: On exponential convergence of SGD in non-convex over-parametrized learning – start-page: 745 year: 2018 end-page: 754 ident: br0080 article-title: Stability and generalization of learning algorithms that converge to global optima publication-title: International Conference on Machine Learning – volume: 3 start-page: 643 year: 1963 end-page: 653 ident: br0320 article-title: Gradient methods for minimizing functionals publication-title: Ž. Vyčisl. Mat. Mat. Fiz. – volume: 1 start-page: 84 issue: 1 year: 2020 ident: 10.1016/j.acha.2021.12.009_br0300 article-title: Toward moderate overparameterization: global convergence guarantees for training shallow neural networks publication-title: IEEE J. Sel. Areas Inf. Theory doi: 10.1109/JSAIT.2020.2991332 – volume: 52 start-page: 474001 issue: 47 year: 2019 ident: 10.1016/j.acha.2021.12.009_br0340 article-title: A jamming transition from under- to over-parametrization affects generalization in deep learning publication-title: J. Phys. A, Math. Theor. doi: 10.1088/1751-8121/ab4c8b – start-page: 1877 year: 2020 ident: 10.1016/j.acha.2021.12.009_br0060 article-title: Language models are few-shot learners – start-page: 770 year: 2016 ident: 10.1016/j.acha.2021.12.009_br0160 article-title: Deep residual learning for image recognition – year: 2020 ident: 10.1016/j.acha.2021.12.009_br0230 article-title: Accelerating SGD with momentum for over-parameterized learning – volume: 31 start-page: 477 issue: 3 year: 2019 ident: 10.1016/j.acha.2021.12.009_br0030 article-title: Gradient descent with identity initialization efficiently learns positive-definite linear transformations by deep residual networks publication-title: Neural Comput. doi: 10.1162/neco_a_01164 – volume: 22 start-page: 1 issue: 68 year: 2021 ident: 10.1016/j.acha.2021.12.009_br0150 article-title: Path length bounds for gradient descent and flow publication-title: J. Mach. Learn. Res. – volume: 27 start-page: 603 issue: 3 year: 2005 ident: 10.1016/j.acha.2021.12.009_br0090 article-title: Condition numbers of Gaussian random matrices publication-title: SIAM J. Matrix Anal. Appl. doi: 10.1137/040616413 – volume: 15 start-page: e0236661 issue: 8 year: 2020 ident: 10.1016/j.acha.2021.12.009_br0360 article-title: Beyond convexity—contraction and global convergence of gradient descent publication-title: PLoS ONE doi: 10.1371/journal.pone.0236661 – ident: 10.1016/j.acha.2021.12.009_br0140 – year: 2019 ident: 10.1016/j.acha.2021.12.009_br0180 article-title: Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow ReLU networks – volume: 6 start-page: 1300 issue: 5 year: 1995 ident: 10.1016/j.acha.2021.12.009_br0370 article-title: On the local minima free condition of backpropagation learning publication-title: IEEE Trans. Neural Netw. doi: 10.1109/72.410380 – ident: 10.1016/j.acha.2021.12.009_br0220 – start-page: 242 year: 2019 ident: 10.1016/j.acha.2021.12.009_br0010 article-title: A convergence theory for deep learning via over-parameterization – volume: 65 start-page: 742 issue: 2 year: 2018 ident: 10.1016/j.acha.2021.12.009_br0330 article-title: Theoretical insights into the optimization landscape of over-parameterized shallow neural networks publication-title: IEEE Trans. Inf. Theory doi: 10.1109/TIT.2018.2854560 – ident: 10.1016/j.acha.2021.12.009_br0040 – volume: vol. 117 start-page: 87 year: 1963 ident: 10.1016/j.acha.2021.12.009_br0250 article-title: A topological property of real analytic subsets – year: 2018 ident: 10.1016/j.acha.2021.12.009_br0120 article-title: Gradient descent provably optimizes over-parameterized neural networks – ident: 10.1016/j.acha.2021.12.009_br0200 – volume: 3 start-page: 676 issue: 2 year: 2021 ident: 10.1016/j.acha.2021.12.009_br0110 article-title: Global minima of overparameterized neural networks publication-title: SIAM J. Math. Data Sci. doi: 10.1137/19M1308943 – start-page: 8571 year: 2018 ident: 10.1016/j.acha.2021.12.009_br0170 article-title: Neural tangent kernel: convergence and generalization in neural networks – volume: 115 start-page: E7665 issue: 33 year: 2018 ident: 10.1016/j.acha.2021.12.009_br0260 article-title: A mean field view of the landscape of two-layer neural networks publication-title: Proc. Natl. Acad. Sci. doi: 10.1073/pnas.1806579115 – year: 2013 ident: 10.1016/j.acha.2021.12.009_br0070 – volume: 116 start-page: 15849 issue: 32 year: 2019 ident: 10.1016/j.acha.2021.12.009_br0050 article-title: Reconciling modern machine- learning practice and the classical bias-variance trade-off publication-title: Proc. Natl. Acad. Sci. doi: 10.1073/pnas.1903070116 – volume: 269 start-page: 543 year: 1983 ident: 10.1016/j.acha.2021.12.009_br0270 article-title: A method for unconstrained convex minimization problem with the rate of convergence O (1/k2) publication-title: Dokl. Acad. Nauk USSR – start-page: 1195 year: 2019 ident: 10.1016/j.acha.2021.12.009_br0350 article-title: Fast and faster convergence of SGD for overparameterized models and an accelerated perceptron – volume: 3 start-page: 643 issue: 4 year: 1963 ident: 10.1016/j.acha.2021.12.009_br0320 article-title: Gradient methods for minimizing functionals publication-title: Ž. Vyčisl. Mat. Mat. Fiz. – start-page: 745 year: 2018 ident: 10.1016/j.acha.2021.12.009_br0080 article-title: Stability and generalization of learning algorithms that converge to global optima – volume: vol. 33 year: 2020 ident: 10.1016/j.acha.2021.12.009_br0240 article-title: On the linearity of large non-linear models: when and why the tangent kernel is constant – year: 2015 ident: 10.1016/j.acha.2021.12.009_br0190 article-title: Adam: a method for stochastic optimization – year: 2006 ident: 10.1016/j.acha.2021.12.009_br0290 – ident: 10.1016/j.acha.2021.12.009_br0310 – volume: 109 start-page: 467 issue: 3 year: 2020 ident: 10.1016/j.acha.2021.12.009_br0380 article-title: Gradient descent optimizes overparameterized deep ReLU networks publication-title: Mach. Learn. doi: 10.1007/s10994-019-05839-6 – start-page: 8570 year: 2019 ident: 10.1016/j.acha.2021.12.009_br0210 article-title: Wide neural networks of any depth evolve as linear models under gradient descent – start-page: 322 year: 2019 ident: 10.1016/j.acha.2021.12.009_br0020 article-title: Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks – year: 2018 ident: 10.1016/j.acha.2021.12.009_br0280 article-title: On the loss landscape of a class of deep neural networks with no bad local valleys – start-page: 2933 year: 2019 ident: 10.1016/j.acha.2021.12.009_br0100 article-title: On lazy training in differentiable programming – start-page: 1675 year: 2019 ident: 10.1016/j.acha.2021.12.009_br0130 article-title: Gradient descent finds global minima of deep neural networks |
| SSID | ssj0011459 |
| Score | 2.7008417 |
| Snippet | The success of deep learning is due, to a large extent, to the remarkable effectiveness of gradient-based optimization methods applied to large neural... |
| SourceID | crossref elsevier |
| SourceType | Enrichment Source Index Database Publisher |
| StartPage | 85 |
| SubjectTerms | Deep learning Non-linear optimization Over-parameterized models PL⁎ condition |
| Title | Loss landscapes and optimization in over-parameterized non-linear systems and neural networks |
| URI | https://dx.doi.org/10.1016/j.acha.2021.12.009 |
| Volume | 59 |
| WOSCitedRecordID | wos000798563300003&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVESC databaseName: Elsevier SD Freedom Collection Journals 2021 issn: 1063-5203 databaseCode: AIEXJ dateStart: 20211209 customDbUrl: isFulltext: true dateEnd: 99991231 titleUrlDefault: https://www.sciencedirect.com omitProxy: false ssIdentifier: ssj0011459 providerName: Elsevier |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV3fa9UwFA5654M-DH_inEoefBuFpGlvmsdtbKi4ITjhIkhJ2pTb_ei97N7K5l_vOU3SVidDBV9Kb2luSs7X9OTkO98h5E2VlFOrjYnAt5_CAiXDQBMvIy1EGeskZtNSd8Um5PFxNpupjz6Ys-rKCcimya6u1PK_mhqugbExdfYvzN3_KVyAczA6HMHscPwjw3-Az95Ol8GL3CYnwbyAieHCZ1xihAN5mxGqfl8gG6b-jhyARROhy6mDurNriXqXYMXGscVXY182OLA-M27ZrkNgEeWwu8o62kue9LSfuvV7_IvrtofUl3nrwgOm7qG6Z899obCj-mweiCA-PBEPVFYfMwt5Mz_ROmEdKmAJzMR4HvbK4G4idXV8_CeZu3TMG7O9CzycAvLmKCEV8y6yy9TwbesZh5-wS-wx5p1M3uwu2YhlqrIJ2dh9dzB732898aSrsNc_os-0cqTAX3v6vTcz8lBOHpJNv7Sguw4Sj8gd2zwmD0aCk_DrqFfpXT0hXxEqdIAKhTM6hgqtG3oTKnSACvVQ6Vo6qNAAlafk8-HByf7byJfbiArB2DqyJlFxmVQS91LhTbVMwZomyXgVZ0UqmZWWlwW8u5UsdMYqKTQ34P0aYUqemUQ8IxN4APucUKYVq8APKq1KE1ggGzXVxrLUWJlaycUW4WHY8sJr0WNJlPM8kA5PcxzqHIc653EOQ71Fdvo2S6fEcuvdabBG7n1J5yPmAJ5b2r34x3bb5P6A_pdksr5s7Styr_i2rleXrz3GfgAkW51h |
| linkProvider | Elsevier |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Loss+landscapes+and+optimization+in+over-parameterized+non-linear+systems+and+neural+networks&rft.jtitle=Applied+and+computational+harmonic+analysis&rft.au=Liu%2C+Chaoyue&rft.au=Zhu%2C+Libin&rft.au=Belkin%2C+Mikhail&rft.date=2022-07-01&rft.pub=Elsevier+Inc&rft.issn=1063-5203&rft.volume=59&rft.spage=85&rft.epage=116&rft_id=info:doi/10.1016%2Fj.acha.2021.12.009&rft.externalDocID=S106352032100110X |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1063-5203&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1063-5203&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1063-5203&client=summon |