A Hybrid Stochastic-Deterministic Minibatch Proximal Gradient Method for Efficient Optimization and Generalization

Despite the success of stochastic variance-reduced gradient (SVRG) algorithms in solving large-scale problems, their stochastic gradient complexity often scales linearly with data size and is expensive for huge data. Accordingly, we propose a hybrid stochastic-deterministic minibatch proximal gradie...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:IEEE transactions on pattern analysis and machine intelligence Ročník 44; číslo 10; s. 5933 - 5946
Hlavní autori: Zhou, Pan, Yuan, Xiao-Tong, Lin, Zhouchen, Hoi, Steven C.H.
Médium: Journal Article
Jazyk:English
Vydavateľské údaje: United States IEEE 01.10.2022
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Predmet:
ISSN:0162-8828, 1939-3539, 2160-9292, 1939-3539
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Popis
Shrnutí:Despite the success of stochastic variance-reduced gradient (SVRG) algorithms in solving large-scale problems, their stochastic gradient complexity often scales linearly with data size and is expensive for huge data. Accordingly, we propose a hybrid stochastic-deterministic minibatch proximal gradient ( HSDMPG ) algorithm for strongly convex problems with linear prediction structure, e.g., least squares and logistic/softmax regression. HSDMPG enjoys improved computational complexity that is data-size-independent for large-scale problems. It iteratively samples an evolving minibatch of individual losses to estimate the original problem, and can efficiently minimize the sampled subproblems. For strongly convex loss of <inline-formula><tex-math notation="LaTeX">n</tex-math> <mml:math><mml:mi>n</mml:mi></mml:math><inline-graphic xlink:href="zhou-ieq1-3087328.gif"/> </inline-formula> components, HSDMPG attains an <inline-formula><tex-math notation="LaTeX">\epsilon</tex-math> <mml:math><mml:mi>ε</mml:mi></mml:math><inline-graphic xlink:href="zhou-ieq2-3087328.gif"/> </inline-formula>-optimization-error within <inline-formula><tex-math notation="LaTeX">\mathcal {O} \left(\kappa \log ^{\zeta +1}\left(\frac{1}{\epsilon }\right)\frac{1}{\epsilon }\bigwedge n\log ^{\zeta }\left(\frac{1}{\epsilon }\right)\right)</tex-math> <mml:math><mml:mrow><mml:mi mathvariant="script">O</mml:mi><mml:mfenced separators="" open="(" close=")"><mml:mi>κ</mml:mi><mml:msup><mml:mo form="prefix">log</mml:mo><mml:mrow><mml:mi>ζ</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mfenced separators="" open="(" close=")"><mml:mfrac><mml:mn>1</mml:mn><mml:mi>ε</mml:mi></mml:mfrac></mml:mfenced><mml:mfrac><mml:mn>1</mml:mn><mml:mi>ε</mml:mi></mml:mfrac><mml:mo>⋀</mml:mo><mml:mi>n</mml:mi><mml:msup><mml:mo form="prefix">log</mml:mo><mml:mi>ζ</mml:mi></mml:msup><mml:mfenced separators="" open="(" close=")"><mml:mfrac><mml:mn>1</mml:mn><mml:mi>ε</mml:mi></mml:mfrac></mml:mfenced></mml:mfenced></mml:mrow></mml:math><inline-graphic xlink:href="zhou-ieq3-3087328.gif"/> </inline-formula> stochastic gradient evaluations, where <inline-formula><tex-math notation="LaTeX">\kappa</tex-math> <mml:math><mml:mi>κ</mml:mi></mml:math><inline-graphic xlink:href="zhou-ieq4-3087328.gif"/> </inline-formula> is condition number, <inline-formula><tex-math notation="LaTeX">\zeta =1</tex-math> <mml:math><mml:mrow><mml:mi>ζ</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:math><inline-graphic xlink:href="zhou-ieq5-3087328.gif"/> </inline-formula> for quadratic loss and <inline-formula><tex-math notation="LaTeX">\zeta =2</tex-math> <mml:math><mml:mrow><mml:mi>ζ</mml:mi><mml:mo>=</mml:mo><mml:mn>2</mml:mn></mml:mrow></mml:math><inline-graphic xlink:href="zhou-ieq6-3087328.gif"/> </inline-formula> for generic loss. For large-scale problems, our complexity outperforms those of SVRG-type algorithms with/without dependence on data size. Particularly, when <inline-formula><tex-math notation="LaTeX">\epsilon =\mathcal {O}(1/\sqrt{n})</tex-math> <mml:math><mml:mrow><mml:mi>ε</mml:mi><mml:mo>=</mml:mo><mml:mi mathvariant="script">O</mml:mi><mml:mo>(</mml:mo><mml:mn>1</mml:mn><mml:mo>/</mml:mo><mml:msqrt><mml:mi>n</mml:mi></mml:msqrt><mml:mo>)</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="zhou-ieq7-3087328.gif"/> </inline-formula> which matches the intrinsic excess error of a learning model and is sufficient for generalization, our complexity for quadratic and generic losses is respectively <inline-formula><tex-math notation="LaTeX">\mathcal {O} (n^{0.5}\log ^{2}(n))</tex-math> <mml:math><mml:mrow><mml:mi mathvariant="script">O</mml:mi><mml:mo>(</mml:mo><mml:msup><mml:mi>n</mml:mi><mml:mrow><mml:mn>0</mml:mn><mml:mo>.</mml:mo><mml:mn>5</mml:mn></mml:mrow></mml:msup><mml:msup><mml:mo form="prefix">log</mml:mo><mml:mn>2</mml:mn></mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mi>n</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="zhou-ieq8-3087328.gif"/> </inline-formula> and <inline-formula><tex-math notation="LaTeX">\mathcal {O} (n^{0.5}\log ^{3}(n))</tex-math> <mml:math><mml:mrow><mml:mi mathvariant="script">O</mml:mi><mml:mo>(</mml:mo><mml:msup><mml:mi>n</mml:mi><mml:mrow><mml:mn>0</mml:mn><mml:mo>.</mml:mo><mml:mn>5</mml:mn></mml:mrow></mml:msup><mml:msup><mml:mo form="prefix">log</mml:mo><mml:mn>3</mml:mn></mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mi>n</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="zhou-ieq9-3087328.gif"/> </inline-formula>, which for the first time achieves optimal generalization in less than a single pass over data. Besides, we extend HSDMPG to online strongly convex problems and prove its higher efficiency over the prior algorithms. Numerical results demonstrate the computational advantages of HSDMPG .
Bibliografia:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
ISSN:0162-8828
1939-3539
2160-9292
1939-3539
DOI:10.1109/TPAMI.2021.3087328