Abstract base class for stochastic (mini-batch) first-order optimizers.
At each iteration the gradient is estimated on a mini batch of the data
drawn from f.args() rather than on the whole sample, and the point is moved
by a step controlled by a (possibly adaptive or scheduled) learning rate.
The mini batches are produced by iterating over the data, optionally shuffled
once per epoch, where an epoch is a full pass over all the samples.
Parameters:
f – the objective function.
x – ([n x 1] real column vector): the point where to start the algorithm from.
step_size – (real scalar > 0, callable or iterable, optional, default value 0.01): the
learning rate, i.e., the size of the step taken along the search direction.
It can be a positive scalar (kept constant for all iterations), a callable
with signature step_size(X_batch, y_batch) returning an iterator (e.g., one
of the schedules in schedules.py) or an iterable yielding the step size to
use at each iteration.
batch_size – (integer scalar or None, optional, default value None): the size of the mini
batches used to estimate the gradient. If None the full sample is used at
every iteration (i.e., plain batch gradient descent); otherwise it is clipped
to lie in [1, n_samples].
eps – (real scalar, optional, default value 1e-6): the accuracy in the stopping
criterion: the algorithm is stopped when the norm of the gradient is less
than or equal to eps.
tol – (real scalar, optional, default value 1e-8): the tolerance used in the
optimality conditions of the Lagrangian dual (when f is a Lagrangian dual
function), i.e., the algorithm is stopped when the variables or the
constraints change by less than tol.
epochs – (integer scalar, optional, default value 1000): the maximum number of epochs,
i.e., full passes over the whole sample, before the algorithm is stopped.
callback – (callable, optional, default value None): a function called at each iteration
with the optimizer instance (and callback_args) as arguments; it can raise
StopIteration to interrupt the optimization.
callback_args – (tuple, optional, default value ()): additional positional arguments passed
to the callback at each call.
shuffle – (boolean, optional, default value True): whether to shuffle the order of the
mini batches at the beginning of each epoch.
random_state – (integer scalar or None, optional, default value None): seed for the random
number generator used to initialize x (when not provided) and to shuffle the
mini batches, for reproducibility.
verbose – (boolean or integer, optional, default value False): print details about each
iteration if True (or every verbose epochs if an integer), nothing otherwise.
Return an iterator that successively yields tuples of aligned mini batches
of size batch_size from the sliceable arrays returned by f.args(),
in random order (when shuffle is True) without replacement.
Returns:
an infinite iterator of mini batches (one tuple of aligned slices per step).
Abstract base class for stochastic optimizers that support a momentum term.
In addition to the plain stochastic step it keeps a velocity that accumulates
an exponentially decaying fraction of the past steps, which damps oscillations
and accelerates convergence along consistent descent directions. Both the
classical heavy-ball (Polyak) momentum and Nesterov’s accelerated momentum are
supported, as selected by momentum_type.
Parameters:
f – the objective function.
x – ([n x 1] real column vector): the point where to start the algorithm from.
step_size – (real scalar > 0, callable or iterable, optional, default value 0.01): the
learning rate, i.e., the size of the step taken along the search direction
(see StochasticOptimizer for the accepted forms).
momentum_type – (string in {‘none’, ‘polyak’, ‘nesterov’}, optional, default value ‘none’):
the kind of momentum to apply: ‘none’ disables momentum, ‘polyak’ uses the
classical heavy-ball momentum and ‘nesterov’ uses Nesterov’s accelerated
momentum (the gradient is evaluated after the momentum jump).
momentum – (real scalar in [0, 1) or iterable, optional, default value 0.9): the momentum
factor, i.e., the fraction of the previous step retained in the current one.
It can be a scalar (kept constant) or an iterable yielding the value to use at
each iteration.
batch_size – (integer scalar or None, optional, default value None): the size of the mini
batches used to estimate the gradient; if None the full sample is used.
eps – (real scalar, optional, default value 1e-6): the accuracy in the stopping
criterion: the algorithm is stopped when the norm of the gradient is less
than or equal to eps.
tol – (real scalar, optional, default value 1e-8): the tolerance used in the
optimality conditions of the Lagrangian dual (when f is a Lagrangian dual).
epochs – (integer scalar, optional, default value 1000): the maximum number of epochs
before the algorithm is stopped.
callback – (callable, optional, default value None): a function called at each iteration
with the optimizer instance (and callback_args) as arguments; it can raise
StopIteration to interrupt the optimization.
callback_args – (tuple, optional, default value ()): additional positional arguments passed
to the callback at each call.
shuffle – (boolean, optional, default value True): whether to shuffle the order of the
mini batches at the beginning of each epoch.
random_state – (integer scalar or None, optional, default value None): seed for the random
number generator, for reproducibility.
verbose – (boolean or integer, optional, default value False): print details about each
iteration if True (or every verbose epochs if an integer), nothing otherwise.
Return an iterator that successively yields tuples of aligned mini batches
of size batch_size from the sliceable arrays returned by f.args(),
in random order (when shuffle is True) without replacement.
Returns:
an infinite iterator of mini batches (one tuple of aligned slices per step).
Stochastic Gradient Descent (SGD) for the minimization of the provided
function f.
At each iteration the point is moved by a fixed (or scheduled) learning rate
along the negative of the gradient estimated on a mini batch of the data,
optionally accelerated by a classical heavy-ball (Polyak) or Nesterov
momentum term as selected by momentum_type.
Parameters:
f – the objective function.
x – ([n x 1] real column vector): the point where to start the algorithm from.
batch_size – (integer scalar or None, optional, default value None): the size of the mini
batches used to estimate the gradient; if None the full sample is used.
eps – (real scalar, optional, default value 1e-6): the accuracy in the stopping
criterion: the algorithm is stopped when the norm of the gradient is less
than or equal to eps.
tol – (real scalar, optional, default value 1e-8): the tolerance used in the
optimality conditions of the Lagrangian dual (when f is a Lagrangian dual).
epochs – (integer scalar, optional, default value 1000): the maximum number of epochs
before the algorithm is stopped.
step_size – (real scalar > 0, callable or iterable, optional, default value 0.01): the
learning rate, i.e., the size of the step taken along the negative gradient.
momentum_type – (string in {‘none’, ‘polyak’, ‘nesterov’}, optional, default value ‘none’):
the kind of momentum to apply (‘none’, heavy-ball ‘polyak’ or ‘nesterov’).
momentum – (real scalar in [0, 1) or iterable, optional, default value 0.9): the momentum
factor, i.e., the fraction of the previous step retained in the current one.
callback – (callable, optional, default value None): a function called at each iteration
with the optimizer instance (and callback_args) as arguments; it can raise
StopIteration to interrupt the optimization.
callback_args – (tuple, optional, default value ()): additional positional arguments passed
to the callback at each call.
shuffle – (boolean, optional, default value True): whether to shuffle the order of the
mini batches at the beginning of each epoch.
random_state – (integer scalar or None, optional, default value None): seed for the random
number generator, for reproducibility.
verbose – (boolean or integer, optional, default value False): print details about each
iteration if True (or every verbose epochs if an integer), nothing otherwise.
Return an iterator that successively yields tuples of aligned mini batches
of size batch_size from the sliceable arrays returned by f.args(),
in random order (when shuffle is True) without replacement.
Returns:
an infinite iterator of mini batches (one tuple of aligned slices per step).
Adam (Adaptive Moment Estimation) for the minimization of the provided
function f.
It keeps exponentially decaying running averages of the gradient (first
moment) and of the squared gradient (second raw moment), both bias-corrected,
and scales the step element-wise by the inverse of the square root of the
second moment, thus adapting the learning rate to each coordinate.
References
Parameters:
f – the objective function.
x – ([n x 1] real column vector): the point where to start the algorithm from.
batch_size – (integer scalar or None, optional, default value None): the size of the mini
batches used to estimate the gradient; if None the full sample is used.
eps – (real scalar, optional, default value 1e-6): the accuracy in the stopping
criterion: the algorithm is stopped when the norm of the gradient is less
than or equal to eps.
tol – (real scalar, optional, default value 1e-8): the tolerance used in the
optimality conditions of the Lagrangian dual (when f is a Lagrangian dual).
epochs – (integer scalar, optional, default value 1000): the maximum number of epochs
before the algorithm is stopped.
step_size – (real scalar > 0, callable or iterable, optional, default value 0.001): the
learning rate, i.e., the base size of the step taken along the search direction.
momentum_type – (string in {‘none’, ‘polyak’, ‘nesterov’}, optional, default value ‘none’):
the kind of (outer) momentum applied on top of the Adam step.
momentum – (real scalar in [0, 1) or iterable, optional, default value 0.9): the (outer)
momentum factor, i.e., the fraction of the previous step retained.
beta1 – (real scalar in [0, 1), optional, default value 0.9): the exponential decay
rate for the first moment (mean) estimate of the gradient.
beta2 – (real scalar in [0, 1), optional, default value 0.999): the exponential decay
rate for the second raw moment (uncentered variance) estimate of the gradient.
offset – (real scalar > 0, optional, default value 1e-8): a small constant added to the
denominator to avoid division by zero and improve numerical stability.
callback – (callable, optional, default value None): a function called at each iteration
with the optimizer instance (and callback_args) as arguments; it can raise
StopIteration to interrupt the optimization.
callback_args – (tuple, optional, default value ()): additional positional arguments passed
to the callback at each call.
shuffle – (boolean, optional, default value True): whether to shuffle the order of the
mini batches at the beginning of each epoch.
random_state – (integer scalar or None, optional, default value None): seed for the random
number generator, for reproducibility.
verbose – (boolean or integer, optional, default value False): print details about each
iteration if True (or every verbose epochs if an integer), nothing otherwise.
Return an iterator that successively yields tuples of aligned mini batches
of size batch_size from the sliceable arrays returned by f.args(),
in random order (when shuffle is True) without replacement.
Returns:
an infinite iterator of mini batches (one tuple of aligned slices per step).
AMSGrad for the minimization of the provided function f.
It is a variant of Adam that, instead of the bias-corrected second moment,
uses the element-wise maximum of all the past second raw moment estimates to
scale the step. This keeps the effective learning rate non-increasing and
fixes a convergence issue of Adam.
References
Parameters:
f – the objective function.
x – ([n x 1] real column vector): the point where to start the algorithm from.
batch_size – (integer scalar or None, optional, default value None): the size of the mini
batches used to estimate the gradient; if None the full sample is used.
eps – (real scalar, optional, default value 1e-6): the accuracy in the stopping
criterion: the algorithm is stopped when the norm of the gradient is less
than or equal to eps.
tol – (real scalar, optional, default value 1e-8): the tolerance used in the
optimality conditions of the Lagrangian dual (when f is a Lagrangian dual).
epochs – (integer scalar, optional, default value 1000): the maximum number of epochs
before the algorithm is stopped.
step_size – (real scalar > 0, callable or iterable, optional, default value 0.001): the
learning rate, i.e., the base size of the step taken along the search direction.
momentum_type – (string in {‘none’, ‘polyak’, ‘nesterov’}, optional, default value ‘none’):
the kind of (outer) momentum applied on top of the AMSGrad step.
momentum – (real scalar in [0, 1) or iterable, optional, default value 0.9): the (outer)
momentum factor, i.e., the fraction of the previous step retained.
beta1 – (real scalar in [0, 1), optional, default value 0.9): the exponential decay
rate for the first moment (mean) estimate of the gradient.
beta2 – (real scalar in [0, 1), optional, default value 0.999): the exponential decay
rate for the second raw moment (uncentered variance) estimate of the gradient.
offset – (real scalar > 0, optional, default value 1e-8): a small constant added to the
denominator to avoid division by zero and improve numerical stability.
callback – (callable, optional, default value None): a function called at each iteration
with the optimizer instance (and callback_args) as arguments; it can raise
StopIteration to interrupt the optimization.
callback_args – (tuple, optional, default value ()): additional positional arguments passed
to the callback at each call.
shuffle – (boolean, optional, default value True): whether to shuffle the order of the
mini batches at the beginning of each epoch.
random_state – (integer scalar or None, optional, default value None): seed for the random
number generator, for reproducibility.
verbose – (boolean or integer, optional, default value False): print details about each
iteration if True (or every verbose epochs if an integer), nothing otherwise.
Return an iterator that successively yields tuples of aligned mini batches
of size batch_size from the sliceable arrays returned by f.args(),
in random order (when shuffle is True) without replacement.
Returns:
an infinite iterator of mini batches (one tuple of aligned slices per step).
AdaMax for the minimization of the provided function f.
It is the variant of Adam based on the infinity norm: instead of the
exponentially decaying average of the squared gradients, it tracks an
exponentially weighted infinity norm of the gradients (the running maximum of
their absolute value) and uses it to scale the bias-corrected first moment.
References
Parameters:
f – the objective function.
x – ([n x 1] real column vector): the point where to start the algorithm from.
batch_size – (integer scalar or None, optional, default value None): the size of the mini
batches used to estimate the gradient; if None the full sample is used.
eps – (real scalar, optional, default value 1e-6): the accuracy in the stopping
criterion: the algorithm is stopped when the norm of the gradient is less
than or equal to eps.
tol – (real scalar, optional, default value 1e-8): the tolerance used in the
optimality conditions of the Lagrangian dual (when f is a Lagrangian dual).
epochs – (integer scalar, optional, default value 1000): the maximum number of epochs
before the algorithm is stopped.
step_size – (real scalar > 0, callable or iterable, optional, default value 0.002): the
learning rate, i.e., the base size of the step taken along the search direction.
momentum_type – (string in {‘none’, ‘polyak’, ‘nesterov’}, optional, default value ‘none’):
the kind of (outer) momentum applied on top of the AdaMax step.
momentum – (real scalar in [0, 1) or iterable, optional, default value 0.9): the (outer)
momentum factor, i.e., the fraction of the previous step retained.
beta1 – (real scalar in [0, 1), optional, default value 0.9): the exponential decay
rate for the first moment (mean) estimate of the gradient.
beta2 – (real scalar in [0, 1), optional, default value 0.999): the exponential decay
rate for the exponentially weighted infinity norm of the gradient.
offset – (real scalar > 0, optional, default value 1e-8): a small constant added to the
denominator to avoid division by zero and improve numerical stability.
callback – (callable, optional, default value None): a function called at each iteration
with the optimizer instance (and callback_args) as arguments; it can raise
StopIteration to interrupt the optimization.
callback_args – (tuple, optional, default value ()): additional positional arguments passed
to the callback at each call.
shuffle – (boolean, optional, default value True): whether to shuffle the order of the
mini batches at the beginning of each epoch.
random_state – (integer scalar or None, optional, default value None): seed for the random
number generator, for reproducibility.
verbose – (boolean or integer, optional, default value False): print details about each
iteration if True (or every verbose epochs if an integer), nothing otherwise.
Return an iterator that successively yields tuples of aligned mini batches
of size batch_size from the sliceable arrays returned by f.args(),
in random order (when shuffle is True) without replacement.
Returns:
an infinite iterator of mini batches (one tuple of aligned slices per step).
AdaGrad (Adaptive Gradient) for the minimization of the provided function f.
It adapts the learning rate to each coordinate by dividing the step by the
square root of the sum of the squares of all the past gradients, so that
frequently updated parameters receive smaller steps and rarely updated ones
larger steps.
References
Parameters:
f – the objective function.
x – ([n x 1] real column vector): the point where to start the algorithm from.
batch_size – (integer scalar or None, optional, default value None): the size of the mini
batches used to estimate the gradient; if None the full sample is used.
eps – (real scalar, optional, default value 1e-6): the accuracy in the stopping
criterion: the algorithm is stopped when the norm of the gradient is less
than or equal to eps.
tol – (real scalar, optional, default value 1e-8): the tolerance used in the
optimality conditions of the Lagrangian dual (when f is a Lagrangian dual).
epochs – (integer scalar, optional, default value 1000): the maximum number of epochs
before the algorithm is stopped.
step_size – (real scalar > 0, callable or iterable, optional, default value 1.): the
learning rate, i.e., the base size of the step taken along the negative gradient.
offset – (real scalar > 0, optional, default value 1e-8): a small constant added to the
accumulated squared gradients to avoid division by zero.
callback – (callable, optional, default value None): a function called at each iteration
with the optimizer instance (and callback_args) as arguments; it can raise
StopIteration to interrupt the optimization.
callback_args – (tuple, optional, default value ()): additional positional arguments passed
to the callback at each call.
shuffle – (boolean, optional, default value True): whether to shuffle the order of the
mini batches at the beginning of each epoch.
random_state – (integer scalar or None, optional, default value None): seed for the random
number generator, for reproducibility.
verbose – (boolean or integer, optional, default value False): print details about each
iteration if True (or every verbose epochs if an integer), nothing otherwise.
Return an iterator that successively yields tuples of aligned mini batches
of size batch_size from the sliceable arrays returned by f.args(),
in random order (when shuffle is True) without replacement.
Returns:
an infinite iterator of mini batches (one tuple of aligned slices per step).
AdaDelta for the minimization of the provided function f.
It is an extension of AdaGrad that replaces the ever-growing sum of squared
gradients with an exponentially decaying average and, by also tracking an
exponentially decaying average of the squared updates, scales the step by the
ratio of these two running averages, removing the need for a manually tuned
global learning rate.
References
Parameters:
f – the objective function.
x – ([n x 1] real column vector): the point where to start the algorithm from.
batch_size – (integer scalar or None, optional, default value None): the size of the mini
batches used to estimate the gradient; if None the full sample is used.
eps – (real scalar, optional, default value 1e-6): the accuracy in the stopping
criterion: the algorithm is stopped when the norm of the gradient is less
than or equal to eps.
tol – (real scalar, optional, default value 1e-8): the tolerance used in the
optimality conditions of the Lagrangian dual (when f is a Lagrangian dual).
epochs – (integer scalar, optional, default value 1000): the maximum number of epochs
before the algorithm is stopped.
step_size – (real scalar > 0, callable or iterable, optional, default value 1.): the
learning rate, i.e., the base size of the step taken along the negative gradient.
decay – (real scalar in [0, 1), optional, default value 0.9): the exponential decay
rate of the running averages of the squared gradients and of the squared updates.
offset – (real scalar > 0, optional, default value 1e-6): a small constant added to the
running averages to avoid division by zero and improve numerical stability.
callback – (callable, optional, default value None): a function called at each iteration
with the optimizer instance (and callback_args) as arguments; it can raise
StopIteration to interrupt the optimization.
callback_args – (tuple, optional, default value ()): additional positional arguments passed
to the callback at each call.
shuffle – (boolean, optional, default value True): whether to shuffle the order of the
mini batches at the beginning of each epoch.
random_state – (integer scalar or None, optional, default value None): seed for the random
number generator, for reproducibility.
verbose – (boolean or integer, optional, default value False): print details about each
iteration if True (or every verbose epochs if an integer), nothing otherwise.
Return an iterator that successively yields tuples of aligned mini batches
of size batch_size from the sliceable arrays returned by f.args(),
in random order (when shuffle is True) without replacement.
Returns:
an infinite iterator of mini batches (one tuple of aligned slices per step).
RMSProp for the minimization of the provided function f.
It divides the learning rate of each coordinate by the square root of an
exponentially decaying average of the squared gradients (the moving root mean
square), so that the effective step size adapts to the recent magnitude of the
gradients; an optional Polyak or Nesterov momentum can be applied on top.
References
Parameters:
f – the objective function.
x – ([n x 1] real column vector): the point where to start the algorithm from.
step_size – (real scalar > 0, callable or iterable, optional, default value 0.001): the
learning rate, i.e., the base size of the step taken along the search direction.
momentum_type – (string in {‘none’, ‘polyak’, ‘nesterov’}, optional, default value ‘none’):
the kind of momentum applied on top of the RMSProp step.
momentum – (real scalar in [0, 1) or iterable, optional, default value 0.9): the momentum
factor, i.e., the fraction of the previous step retained in the current one.
batch_size – (integer scalar or None, optional, default value None): the size of the mini
batches used to estimate the gradient; if None the full sample is used.
eps – (real scalar, optional, default value 1e-6): the accuracy in the stopping
criterion: the algorithm is stopped when the norm of the gradient is less
than or equal to eps.
tol – (real scalar, optional, default value 1e-8): the tolerance used in the
optimality conditions of the Lagrangian dual (when f is a Lagrangian dual).
epochs – (integer scalar, optional, default value 1000): the maximum number of epochs
before the algorithm is stopped.
decay – (real scalar in [0, 1), optional, default value 0.9): the exponential decay
rate of the moving average of the squared gradients.
offset – (real scalar > 0, optional, default value 1e-8): a small constant added to the
denominator to avoid division by zero and improve numerical stability.
callback – (callable, optional, default value None): a function called at each iteration
with the optimizer instance (and callback_args) as arguments; it can raise
StopIteration to interrupt the optimization.
callback_args – (tuple, optional, default value ()): additional positional arguments passed
to the callback at each call.
shuffle – (boolean, optional, default value True): whether to shuffle the order of the
mini batches at the beginning of each epoch.
random_state – (integer scalar or None, optional, default value None): seed for the random
number generator, for reproducibility.
verbose – (boolean or integer, optional, default value False): print details about each
iteration if True (or every verbose epochs if an integer), nothing otherwise.
Return an iterator that successively yields tuples of aligned mini batches
of size batch_size from the sliceable arrays returned by f.args(),
in random order (when shuffle is True) without replacement.
Returns:
an infinite iterator of mini batches (one tuple of aligned slices per step).