Adam

class dragon.vm.torch.optim.Adam(
  params,
  lr=0.001,
  beta1=0.9,
  beta2=0.999,
  eps=1e-08,
  weight_decay=0,
  amsgrad=False,
  scale_gradient=1.0,
  clip_gradient=-1.0
)[source]

The optimizer which implements Adam algorithm. [Kingma & Ba, 2014].

The Adam update is defined as:

\[\text{Adam}(g) = -\frac{\text{lr} * m_{t}}{\sqrt{v_{t}} + \epsilon} \\ \quad \\ \text{where}\quad \begin{cases} m_{t} = \beta_{1} * m_{t-1} + (1 - \beta_{1}) * g \\ v_{t} = \beta_{2} * v_{t-1} + (1 - \beta_{2}) * g^{2} \end{cases} \]

__init__

Adam.__init__(
  params,
  lr=0.001,
  beta1=0.9,
  beta2=0.999,
  eps=1e-08,
  weight_decay=0,
  amsgrad=False,
  scale_gradient=1.0,
  clip_gradient=-1.0
)[source]

Create an Adam optimizer.

Parameters:
  • params (Sequence[dragon.vm.torch.nn.Parameter]) – The parameters to optimize.
  • lr (float, required) – The initial value for \(\text{lr}\).
  • beta1 (float, optional, default=0.9) – The initial value for \(\beta_{1}\).
  • beta2 (float, optional, default=0.999) – The initial value for \(\beta_{2}\).
  • eps (float, optional, default=1e-8) – The initial value of \(\epsilon\).
  • weight_decay (float, optional, default=-1.) – The factor of L2 penalty.
  • amsgrad (bool, optional, default=False) – True to switch to AMSGrad optimizer.
  • scale_gradient (float, optional, default=1.) – The factor to scale gradients.
  • clip_gradient (float, optional, default=-1.) – The norm thresh to clip gradients.

Methods

accumulate_grad

Optimizer.accumulate_grad()[source]

Accumulate all gradients.

Call this method after a backward pass:

x = torch.ones(1, 3, requires_grad=True)
for i in range(10):
    y = x + 1
    y.backward()
    optimizer.accumulate_grad()
optimizer.step()

add_param_group

Optimizer.add_param_group(param_group)[source]

Add a new param group into the optimizer.

The param_group should be a dict containing the defaults optionally:

# A group redefined ``lr`` and ``weight_decay``
param_group1 = {
    'params': [],
    'lr': 0.01,
    'weight_decay': 0.0001,
}
# A group inherits the defaults while using ``multiplier``
param_group2 = {
    'params': [],
    'lr_mult': 1.,
    'decay_mult': 1.,
}
Parameters:
  • param_group (Dict) – The param group to add.

step

Optimizer.step()[source]

Perform one step update.

Call this method after a backward pass:

x = torch.ones(1, 3, requires_grad=True)
y = x + 1
y.backward()
optimizer.step()

zero_grad

Optimizer.zero_grad(reset=False)[source]

Set all gradients to zeros.

This method is not necessary usually, as we will overwrite the gradients in the next computation.

However, if some gradients are not computed every time, remember to reset them before step(...):

m1 = torch.nn.Linear(3, 3)
m2 = torch.nn.Linear(3, 3)
x = torch.ones(1, 3, requires_grad=True)
for i in range(10):
    x = m1(x)
    if i in (2, 4, 6):
        x += m2(x)
optimizer.zero_grad(reset=True)
x.backward()
Parameters:
  • reset (bool, optional, default=False) – True to reset the memory instead of zeroing.