SGD Nesterov for Optim

Any idea why nesterov is not available under optim? Seems to be available under legacy here:

Does it mean if I’d like to have nesterov, I’ve to modify PyTorch optim.SGD?

Found a suggestion on Github by ajbrock to change it to:

from .optimizer import Optimizer, required

class SGD(Optimizer):
    """Implements stochastic gradient descent (optionally with momentum).
        params (iterable): iterable of parameters to optimize or dicts defining
            parameter groups
        lr (float): learning rate
        momentum (float, optional): momentum factor (default: 0)
        weight_decay (float, optional): weight decay (L2 penalty) (default: 0)
        dampening (float, optional): dampening for momentum (default: 0)
        nesterov(bool, optional): enables Nesterov momentum (default: False)
        >>> optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
        >>> optimizer.zero_grad()
        >>> loss_fn(model(input), target).backward()
        >>> optimizer.step()

    def __init__(self, params, lr=required, momentum=0, dampening=0,
                 weight_decay=0, nesterov=False):
        defaults = dict(lr=lr, momentum=momentum, dampening=dampening,
                        weight_decay=weight_decay, nesterov=nesterov)
        if nesterov and (momentum <= 0 and dampening != 0):
            raise ValueError("Nesterov momentum requires a momentum and zero dampening")
        super(SGD, self).__init__(params, defaults)

    def step(self, closure=None):
        """Performs a single optimization step.
            closure (callable, optional): A closure that reevaluates the model
                and returns the loss.
        loss = None
        if closure is not None:
            loss = closure()

        for group in self.param_groups:
            weight_decay = group['weight_decay']
            momentum = group['momentum']
            dampening = group['dampening']
            nesterov = group['nesterov']

            for p in group['params']:
                d_p =
                if weight_decay != 0:
                if momentum != 0:
                    param_state = self.state[p]
                    if 'momentum_buffer' not in param_state:
                        param_state['momentum_buffer'] = d_p.clone()
                        buf = param_state['momentum_buffer']
                        buf.mul_(momentum).add_(1 - dampening, d_p)
                        if nesterov:
                            d_p.add_(momentum, buf)
                            d_p = buf

      ['lr'], d_p)

        return loss

Yes, it’s going to be merged into master soon.


Why is the Nesterov method requires Momentum?
Is it implemented in Nesterov way or in the simpler form of FISTA?

Accelerated gradient descent is not a momentum method, but it has been shown that it is closely related and the update rule can be rewritten as a momentum-like update rule.

I think that in deep learning literature, the method has been introduced by Ilya Sutskever, and since then, the implementations are closely based on the original paper:

