AdamW

class dragon.optimizers.AdamW(
  lr=0.001,
  beta1=0.9,
  beta2=0.999,
  eps=1e-08,
  weight_decay=0.01,
  **kwargs
)[source]

The optimizer to apply AdamW algorithm. [Loshchilov & Hutter, 2017].

The AdamW update is defined as:

\[\text{AdamW}(g, p) = -\text{lr} * (\frac{m_{t}}{\sqrt{v_{t}} + \epsilon} + \lambda p) \\ \quad \\ \text{where}\quad \begin{cases} m_{t} = \beta_{1} * m_{t-1} + (1 - \beta_{1}) * g \\ v_{t} = \beta_{2} * v_{t-1} + (1 - \beta_{2}) * g^{2} \end{cases} \]

__init__

AdamW.__init__(
  lr=0.001,
  beta1=0.9,
  beta2=0.999,
  eps=1e-08,
  weight_decay=0.01,
  **kwargs
)[source]

Create an AdamW updater.

Parameters:
  • lr (float, optional, default=0.001) – The initial value to \(\text{lr}\).
  • beta1 (float, optional, default=0.9) – The initial value to \(\beta_{1}\).
  • beta2 (float, optional, default=0.999) – The initial value to \(\beta_{2}\).
  • eps (float, optional, default=1e-8) – The initial value to \(\epsilon\)
  • weight_decay (float, optional, default=0.01) – The initial value to \(\lambda\).

Methods

apply_gradients

Optimizer.apply_gradients(grads_and_vars)[source]

Apply the gradients on variables.

Parameters:
  • grads_and_vars (Sequence[Sequence[dragon.Tensor]]) – The sequence of update pair.