SGD

class dragon.vm.torch.optim.SGD(
  params,
  lr=<object object>,
  momentum=0,
  dampening=0,
  weight_decay=0,
  nesterov=False,
  scale=1,
  clip_norm=0
)[source]

The optimizer to apply SGD algorithm.

Following SGD algorithms are supported:

VanillaSGD, whose update is defined as:

\[\text{VanillaSGD}(g) = -\text{lr} * g \]

MomentumSGD [Polyak, 1964], whose update is defined as:

\[\text{MomentumSGD}(g) = -(\text{momentum} * m_{t-1} + \text{lr} * g) \]

NesterovSGD [Sutskever et.al, 2013], whose update is defined as:

\[\text{NesterovSGD}(g) = -((1 + \text{momentum}) * m_{t} - \text{momentum} * m_{t-1}) \\ \quad \\ \text{where} \quad m_{t} = \text{momentum} * m_{t-1} + \text{lr} * g \]

You can use one of them by setting the defaults:

# Set the ``lr`` only
vanilla_sgd = torch.optim.SGD(lr=0.1)

# Set the ``lr`` and ``momentum``
momentum_sgd = torch.optim.SGD(lr=0.1, momentum=0.9)

# Set the ``lr``, ``momentum`` and ``nesterov``
nesterov_sgd = torch.optim.SGD(lr=0.1, momentum=0.9, nesterov=True)

__init__

SGD.__init__(
  params,
  lr=<object object>,
  momentum=0,
  dampening=0,
  weight_decay=0,
  nesterov=False,
  scale=1,
  clip_norm=0
)[source]

Create a SGD optimizer.

Parameters:
  • params (Sequence[dragon.vm.torch.nn.Parameter]) – The parameters to optimize.
  • lr (float, required) – The initial value for \(\text{lr}\).
  • momentum (float, optional, default=0) – The initial value for \(\text{momentum}\).
  • dampening (float, optional, default=0) – The dampening for \(\text{momentum}\).
  • weight_decay (float, optional, default=0) – The L2 penalty factor to weight.
  • nesterov (bool, optional, default=False) – True to switch to NesterovSGD optimizer.
  • scale (float, optional, default=1) – The scaling factor to gradient.
  • clip_norm (float, optional, default=0) – The maximum L2 norm to clip gradient.

Methods

accumulate

Optimizer.accumulate(momentum)[source]

Accumulate the gradient of params.

Call this method after each backward pass:

x = torch.ones(1, requires_grad=True)
optimizer = torch.optim.SGD([x], lr=0.1)
for epoch in range(2):
    for step in range(3):
        y = x + 1
        y.backward()
        # Note to zero the accumulation at the first step
        optimizer.accumulate(momentum=1 if step > 0 else 1)
    optimizer.step()
print(x)  # 0.4
Parameters:
  • momentum (float, required) – The momentum to the accumulated value.

add_param_group

Optimizer.add_param_group(param_group)[source]

Add a new param group into the optimizer.

The param_group should be a dict containing the defaults optionally:

# A group redefined ``lr`` and ``weight_decay``
param_group1 = {
    'params': [],
    'lr': 0.01,
    'weight_decay': 0.0001,
}
# A group inherits the defaults while using ``multiplier``
param_group2 = {
    'params': [],
    'lr_mult': 1,
    'decay_mult': 1,
}
Parameters:
  • param_group (dict) – The param group to add.

step

Optimizer.step()[source]

Perform one step update.

Call this method after a backward pass:

x = torch.ones(1, 3, requires_grad=True)
y = x + 1
y.backward()
optimizer.step()

zero_grad

Optimizer.zero_grad(reset=False)[source]

Set the gradient of params to zero.

This method is not necessary usually, as we will overwrite the gradients in the next computation.

However, if some gradients are not computed every time, remember to reset them before step(...):

m1 = torch.nn.Linear(3, 3)
m2 = torch.nn.Linear(3, 3)
x = torch.ones(1, 3, requires_grad=True)
for i in range(10):
    x = m1(x)
    if i in (2, 4, 6):
        x += m2(x)
optimizer.zero_grad(reset=True)
x.backward()
optimizer.step()
Parameters:
  • reset (bool, optional, default=False) – True to reset the memory instead of zeroing.