Help regarding debugging custom optimizer?

Hi everyone. I’m trying to implement the elastic averaging stochastic gradient descent (EASGD) algorithm from the paper Deep Learning with Elastic Averaging SGD and was running into some trouble.

I’m using PyTorch’s torch.optim.Optimizer class and referencing the official implementation of SGD and the official implementation of Accelerated SGD in order to start off somewhere.

The code that I have is:

import torch.optim as optim


class EASGD(optim.Optimizer):
    def __init__(self, params, lr, tau, alpha=0.001):
        self.alpha = alpha

        if lr < 0.0:
            raise ValueError(f"Invalid learning rate {lr}.")

        defaults = dict(lr=lr, alpha=alpha, tau=tau)
        super(EASGD, self).__init__(params, defaults)

    def __setstate__(self, state):
        super(EASGD, self).__setstate__(state)

    def step(self, closure=None):
        loss = None
        if closure is not None:
            with torch.enable_grad():
                loss = closure()

        for group in self.param_groups:
            tau = group['tau']

            for t, p in enumerate(group['params']):
                x_normal = p.clone()
                x_tilde = p.clone()

                if p.grad is None:
                    continue

                if t % tau == 0:
                    p = p - self.alpha * (x_normal - x_tilde)
                    x_tilde = x_tilde + self.alpha * (x_normal - x_tilde)

                d_p = p.grad.data
                p.data.add_(d_p, alpha=-group['lr'])

        return loss

When I run this code, I get the following error:

/home/user/github/test-repo/easgd.py:50: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the gradient for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more informations.

Reading this PyTorch Discussion helped understand what the difference between leaf and non-leaf variables are, but I’m not sure how I should fix my code to make it work properly.

Any tips on what to do or where to look are appreciated. Thanks.

This line of code:

p = p - self.alpha * (x_normal - x_tilde)

would create a non-leaf tensor, so that the following p.grad call would raise this warning.
Based on your code I think you should new a new variable name for this assignment and keep p pointing to the parameters in the optimizer.
Also, don’t use the .data attribute, as it might have unwanted side effects and wrap the code in a torch.no_grad() block, if necessary.

1 Like