A Study of Generalization of Stochastic Mirror Descent Algorithms on Overparameterized Nonlinear Models

We study the convergence, the implicit regularization and the generalization of stochastic mirror descent (SMD) algorithms in overparameterized nonlinear models, where the number of model parameters exceeds the number of training data points. Due to overpa-rameterization, the training loss has infin...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:Proceedings of the ... IEEE International Conference on Acoustics, Speech and Signal Processing (1998) s. 3132 - 3136
Hlavní autori: Azizan, Navid, Lale, Sahin, Hassibi, Babak
Médium: Konferenčný príspevok..
Jazyk:English
Vydavateľské údaje: IEEE 01.05.2020
Predmet:
ISSN:2379-190X
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Popis
Shrnutí:We study the convergence, the implicit regularization and the generalization of stochastic mirror descent (SMD) algorithms in overparameterized nonlinear models, where the number of model parameters exceeds the number of training data points. Due to overpa-rameterization, the training loss has infinitely many global minima where they define a manifold of interpolating solutions. To have an understanding of the generalization performance of SMD algorithms, it is important to characterize which global minima the SMD algorithms converge to. In this work, we first theoretically show that in the overparameterized nonlinear setting, if the initialization is close enough to the manifold of global minima, which is usually the case in the high overparameterization setting, using sufficiently small step size, SMD converges to a global minimum. We further prove that this global minimum is approximately the closest one to the initialization in Bregman divergence, demonstrating the approximate implicit regularization of SMD. We then empirically confirm that these theoretical results are observed in practice. Finally, we provide an extensive study of the generalization of SMD algorithms. In our experiments, we show that on the CIFAR-10 dataset, SMD with ℓ 10 norm potential (as a surrogate for ℓ ∞ ) consistently general-izes better than SGD (corresponding to an ℓ 2 norm potential), which in turn consistently outperforms SMD with ℓ 1 norm potential.
ISSN:2379-190X
DOI:10.1109/ICASSP40776.2020.9053864