Table Of Contents
Table Of Contents


class mxnet.optimizer.Adam(learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08, lazy_update=True, **kwargs)[source]

The Adam optimizer.

This class implements the optimizer described in Adam: A Method for Stochastic Optimization, available at

If the storage types of grad is row_sparse, and lazy_update is True, lazy updates are applied by:

for row in grad.indices:
    rescaled_grad[row] = clip(grad[row] * rescale_grad + wd * weight[row], clip_gradient)
    m[row] = beta1 * m[row] + (1 - beta1) * rescaled_grad[row]
    v[row] = beta2 * v[row] + (1 - beta2) * (rescaled_grad[row]**2)
    w[row] = w[row] - learning_rate * m[row] / (sqrt(v[row]) + epsilon)

The lazy update only updates the mean and var for the weights whose row_sparse gradient indices appear in the current batch, rather than updating it for all indices. Compared with the original update, it can provide large improvements in model training throughput for some applications. However, it provides slightly different semantics than the original update, and may lead to different empirical results.

Otherwise, standard updates are applied by:

rescaled_grad = clip(grad * rescale_grad + wd * weight, clip_gradient)
m = beta1 * m + (1 - beta1) * rescaled_grad
v = beta2 * v + (1 - beta2) * (rescaled_grad**2)
w = w - learning_rate * m / (sqrt(v) + epsilon)

This optimizer accepts the following parameters in addition to those accepted by Optimizer.

For details of the update algorithm, see adam_update.

  • beta1 (float, optional) – Exponential decay rate for the first moment estimates.

  • beta2 (float, optional) – Exponential decay rate for the second moment estimates.

  • epsilon (float, optional) – Small value to avoid division by 0.

  • lazy_update (bool, optional) – Default is True. If True, lazy updates are applied if the storage types of weight and grad are both row_sparse.

__init__(learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08, lazy_update=True, **kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.


__init__([learning_rate, beta1, beta2, …])

Initialize self.

create_optimizer(name, **kwargs)

Instantiates an optimizer with a given name and kwargs.

create_state(index, weight)

Creates auxiliary state for a given weight.

create_state_multi_precision(index, weight)

Creates auxiliary state for a given weight, including FP32 high precision copy if original weight is FP16.


Registers a new optimizer.


Sets a new learning rate of the optimizer.


Sets an individual learning rate multiplier for each parameter.


[DEPRECATED] Sets lr scale.


Sets an individual weight decay multiplier for each parameter.

update(index, weight, grad, state)

Updates the given parameter using the corresponding gradient and state.

update_multi_precision(index, weight, grad, …)

Updates the given parameter using the corresponding gradient and state.