Sample size selection in optimization methods for machine learning

This paper presents a methodology for using varying sample sizes in batch-type optimization methods for large-scale machine learning problems. The first part of the paper deals with the delicate issue of dynamic sample selection in the evaluation of the function and gradient. We propose a criterion...

Full description

Saved in:

Bibliographic Details
Published in:	Mathematical programming Vol. 134; no. 1; pp. 127 - 155
Main Authors:	Byrd, Richard H., Chin, Gillian M., Nocedal, Jorge, Wu, Yuchen
Format:	Journal Article Conference Proceeding
Language:	English
Published:	Berlin/Heidelberg Springer-Verlag 01.08.2012 Springer Springer Nature B.V
Subjects:	Algorithms Alliances Applied sciences Approximation Calculus of variations and optimal control Calculus of Variations and Optimal Control; Optimization Combinatorics Datasets Exact sciences and technology Full Length Paper Input output Machine learning Mathematical analysis Mathematical and Computational Physics Mathematical Methods in Physics Mathematical programming Mathematics Mathematics and Statistics Mathematics of Computing Methods Numerical Analysis Operational research and scientific management Operational research. Management science Optimization Optimization algorithms Sample size Sample variance Sampling techniques Sciences and techniques of general use Studies Theoretical Voice recognition 65K05 49M15 49M37 Costs Sample size Iterative method Non linear programming Lot sizing Dimensioning Function evaluation Vector space Learning (artificial intelligence) Batch process Speech recognition Batch production Sampling Newton method Gradient method Mathematical programming Large scale system
ISSN:	0025-5610, 1436-4646
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	This paper presents a methodology for using varying sample sizes in batch-type optimization methods for large-scale machine learning problems. The first part of the paper deals with the delicate issue of dynamic sample selection in the evaluation of the function and gradient. We propose a criterion for increasing the sample size based on variance estimates obtained during the computation of a batch gradient. We establish an complexity bound on the total cost of a gradient method. The second part of the paper describes a practical Newton method that uses a smaller sample to compute Hessian vector-products than to evaluate the function and the gradient, and that also employs a dynamic sampling technique. The focus of the paper shifts in the third part of the paper to L 1 -regularized problems designed to produce sparse solutions. We propose a Newton-like method that consists of two phases: a (minimalistic) gradient projection phase that identifies zero variables, and subspace phase that applies a subsampled Hessian Newton iteration in the free variables. Numerical tests on speech recognition problems illustrate the performance of the algorithms.
Bibliography:	SourceType-Scholarly Journals-1 ObjectType-Feature-1 content type line 14 ObjectType-Article-2 content type line 23
ISSN:	0025-5610 1436-4646
DOI:	10.1007/s10107-012-0572-5