Table Of Contents
Table Of Contents


class mxnet.optimizer.SGD(momentum=0.0, lazy_update=True, **kwargs)[source]

The SGD optimizer with momentum and weight decay.

If the storage types of grad is row_sparse and lazy_update is True, lazy updates are applied by:

for row in grad.indices:
    rescaled_grad[row] = lr * (rescale_grad * clip(grad[row], clip_gradient) + wd * weight[row])
    state[row] = momentum[row] * state[row] + rescaled_grad[row]
    weight[row] = weight[row] - state[row]

The sparse update only updates the momentum for the weights whose row_sparse gradient indices appear in the current batch, rather than updating it for all indices. Compared with the original update, it can provide large improvements in model training throughput for some applications. However, it provides slightly different semantics than the original update, and may lead to different empirical results.

Otherwise, standard updates are applied by:

rescaled_grad = lr * (rescale_grad * clip(grad, clip_gradient) + wd * weight)
state = momentum * state + rescaled_grad
weight = weight - state

For details of the update algorithm see sgd_update and sgd_mom_update.

This optimizer accepts the following parameters in addition to those accepted by Optimizer.

  • momentum (float, optional) – The momentum value.

  • lazy_update (bool, optional) – Default is True. If True, lazy updates are applied if the storage types of weight and grad are both row_sparse.

  • multi_precision (bool, optional) –

    Flag to control the internal precision of the optimizer.:

    False: results in using the same precision as the weights (default),
    True: makes internal 32-bit copy of the weights and applies gradients
    in 32-bit precision even if actual weights used in the model have lower precision.
    Turning this on can improve convergence and accuracy when training with float16.

__init__(momentum=0.0, lazy_update=True, **kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.


__init__([momentum, lazy_update])

Initialize self.

create_optimizer(name, **kwargs)

Instantiates an optimizer with a given name and kwargs.

create_state(index, weight)

Creates auxiliary state for a given weight.

create_state_multi_precision(index, weight)

Creates auxiliary state for a given weight, including FP32 high precision copy if original weight is FP16.


Registers a new optimizer.


Sets a new learning rate of the optimizer.


Sets an individual learning rate multiplier for each parameter.


[DEPRECATED] Sets lr scale.


Sets an individual weight decay multiplier for each parameter.

update(index, weight, grad, state)

Updates the given parameter using the corresponding gradient and state.

update_multi_precision(index, weight, grad, …)

Updates the given parameter using the corresponding gradient and state.