Table Of Contents
Table Of Contents

mx.nd.ftml.update

Description

The FTML optimizer described in FTML - Follow the Moving Leader in Deep Learning, available at http://proceedings.mlr.press/v70/zheng17a/zheng17a.pdf.

\[\begin{split} g_t = \nabla J(W_{t-1})\\ v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2\\ d_t = \frac{ 1 - \beta_1^t }{ \eta_t } (\sqrt{ \frac{ v_t }{ 1 - \beta_2^t } } + \epsilon) \sigma_t = d_t - \beta_1 d_{t-1} z_t = \beta_1 z_{ t-1 } + (1 - \beta_1^t) g_t - \sigma_t W_{t-1} W_t = - \frac{ z_t }{ d_t }\end{split}\]

Arguments

Argument

Description

weight

NDArray-or-Symbol.

Weight

grad

NDArray-or-Symbol.

Gradient

d

NDArray-or-Symbol.

Internal state d_t

v

NDArray-or-Symbol.

Internal state v_t

z

NDArray-or-Symbol.

Internal state z_t

lr

float, required.

Learning rate.

beta1

float, optional, default=0.6.

Generally close to 0.5.

beta2

float, optional, default=0.999.

Generally close to 1.

epsilon

double, optional, default=1e-08.

Epsilon to prevent div 0.

t

int, required.

Number of update.

wd

float, optional, default=0.

Weight decay augments the objective function with a regularization term that penalizes large weights. The penalty scales with the square of the magnitude of each weight.

rescale.grad

float, optional, default=1.

Rescale gradient to grad = rescale_grad*grad.

clip.grad

float, optional, default=-1.

Clip gradient to the range of [-clip_gradient, clip_gradient] If clip_gradient <= 0, gradient clipping is turned off. grad = max(min(grad, clip_gradient), -clip_gradient).