Table Of Contents
Table Of Contents


class mxnet.optimizer.Ftrl(lamda1=0.01, learning_rate=0.1, beta=1, **kwargs)[source]

The Ftrl optimizer.

Referenced from Ad Click Prediction: a View from the Trenches, available at

eta :
\[\eta_{t,i} = \frac{learningrate}{\beta+\sqrt{\sum_{s=1}^tg_{s,i}^2}}\]

The optimizer updates the weight by:

rescaled_grad = clip(grad * rescale_grad, clip_gradient)
z += rescaled_grad - (sqrt(n + rescaled_grad**2) - sqrt(n)) * weight / learning_rate
n += rescaled_grad**2
w = (sign(z) * lamda1 - z) / ((beta + sqrt(n)) / learning_rate + wd) * (abs(z) > lamda1)

If the storage types of weight, state and grad are all row_sparse, sparse updates are applied by:

for row in grad.indices:
    rescaled_grad[row] = clip(grad[row] * rescale_grad, clip_gradient)
    z[row] += rescaled_grad[row] - (sqrt(n[row] + rescaled_grad[row]**2) - sqrt(n[row])) * weight[row] / learning_rate
    n[row] += rescaled_grad[row]**2
    w[row] = (sign(z[row]) * lamda1 - z[row]) / ((beta + sqrt(n[row])) / learning_rate + wd) * (abs(z[row]) > lamda1)

The sparse update only updates the z and n for the weights whose row_sparse gradient indices appear in the current batch, rather than updating it for all indices. Compared with the original update, it can provide large improvements in model training throughput for some applications. However, it provides slightly different semantics than the original update, and may lead to different empirical results.

For details of the update algorithm, see ftrl_update.

This optimizer accepts the following parameters in addition to those accepted by Optimizer.

  • lamda1 (float, optional) – L1 regularization coefficient.

  • learning_rate (float, optional) – The initial learning rate.

  • beta (float, optional) – Per-coordinate learning rate correlation parameter.

__init__(lamda1=0.01, learning_rate=0.1, beta=1, **kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.


__init__([lamda1, learning_rate, beta])

Initialize self.

create_optimizer(name, **kwargs)

Instantiates an optimizer with a given name and kwargs.

create_state(index, weight)

Creates auxiliary state for a given weight.

create_state_multi_precision(index, weight)

Creates auxiliary state for a given weight, including FP32 high precision copy if original weight is FP16.


Registers a new optimizer.


Sets a new learning rate of the optimizer.


Sets an individual learning rate multiplier for each parameter.


[DEPRECATED] Sets lr scale.


Sets an individual weight decay multiplier for each parameter.

update(index, weight, grad, state)

Updates the given parameter using the corresponding gradient and state.

update_multi_precision(index, weight, grad, …)

Updates the given parameter using the corresponding gradient and state.