Zero grad on single parameter

Intel_Novel · March 17, 2019, 8:47pm

Hi,
I found this this code to zero the gradients on single parameter:
a.grad.zero_()

But it is not working: AttributeError: 'NoneType' object has no attribute 'zero_'

I previously declared:

a = torch.tensor(-1., requires_grad=True)
a = nn.Parameter(a)

sparseinference · March 17, 2019, 9:08pm

The gradient will be computed when you call:

a.backward()

MariosOreo · March 18, 2019, 2:24am

Hi,

Before you call .backward(), the gradient of each tensor which requires_grad=True are all None.
Like the case you posted, you could calculate a.grad firstly and then zero_() its grad.
In opt.zero_grad() it declare explicity:

    def zero_grad(self):
        r"""Clears the gradients of all optimized :class:`torch.Tensor` s."""
        for group in self.param_groups:
            for p in group['params']:
                if p.grad is not None:
                    p.grad.detach_()
                    p.grad.zero_()

May this help you.

Intel_Novel · March 18, 2019, 8:09am

Thanks for the feedback, it was None for sure and this caused the error.
Here is the full code of the problem.

import torch
import torch.nn as nn
import matplotlib.pyplot as plt

n = 70 # num of points

# x is a tensor
x = torch.linspace(0, 10, steps=n)
k = torch.tensor(2.5)

# y is a tensor
y = k*x + 5*torch.rand(n)

# loss function
def mse(y_hat, y): return ((y_hat-y)**2).mean()

a = torch.tensor(-1., requires_grad=True)
a = nn.Parameter(a)

y_hat = a*x


plt.scatter(x, y);
plt.scatter(x,y_hat.detach());

lr = 0.1

def update_a():    
    global a
    loss = mse(y_hat, y) 
    print(loss) 
    loss.backward(a)
    print(a.grad)
    a.grad.zero_()
    a = a - lr * a.grad
   
    

for t in range(10): 
    a = update_a()

Plan is to learn the curve direction that was originally -1 and it should be 2.5.

Green should be more like blue dots.

I use the mse loss function, and I expect to learn the parameter a.

MariosOreo · March 18, 2019, 8:54am

Hi,

I just run your code snipper, but there is something wrong, it raised RuntimeError about call .backward on the computation graph which had been freed. And a.grad is still None except the first iteration, so it learned nothing. Additionally, your learning_rate is too high.
I write a snippet based on yours, and it works now.

 n = 70  # num of points
 # x is a tensor
 x = torch.linspace(0, 10, steps=n)
 k = torch.tensor(2.5)
 # y is a tensor
 y = k * x + 5 * torch.rand(n)

 # loss function
 def mse(y_hat, y):
     return ((y_hat - y) ** 2).mean()

 a = torch.tensor(-1., requires_grad=True)
 a = nn.Parameter(a)
 lr = 0.005

 for t in range(10):
     print(t, ':', a.grad, a.requires_grad)
     if a.grad is not None:
         a.grad.zero_()
     y_hat = a * x
     loss = mse(y_hat, y)
     loss.backward()
     #print(t, 'after backward:', a.grad, a.requires_grad, a.is_leaf)
     a = (a.data - lr * a.grad).requires_grad_(True)
     #print(t, 'after update:', a.requires_grad, a.is_leaf)

 plt.scatter(x, y)
 plt.scatter(x, y_hat.detach())
 plt.show()

You should calculate y_hat in the loop, otherwise, there will raise a RuntimeError mentioned above.
If you assign a directly in each iteration, a will only have grad in the first iteration since a will be a non_leaf Variable and its grad will be None.

forums_1

EDIT: oh, I forgot to zero a.grad and I have correct it in the snippet.

Intel_Novel · March 18, 2019, 10:06am

Phenomenal update. You anticipated the problems I had and helped me.
I learned much from your code.
I noticed that this also works:

    with torch.no_grad():
          a.sub_(lr * a.grad)

once replaced this code:
a = (a.data - lr * a.grad).requires_grad_(True)

sparseinference · March 18, 2019, 10:09am

Well, @MariosOreo beat me to it, but here’s my rewrite of your code anyway:

import torch
import torch.nn as nn
import matplotlib.pyplot as plt

n = 70
x = torch.linspace(0, 10, steps=n)
k = 2.5

y = k*x + 5*torch.rand(n)
a = torch.tensor([-1.0, 0.0], requires_grad=True)

def y_hat():
    return a[0] * x + a[1]

lr = 0.01
for t in range(10): 
    loss = (y_hat() - y).pow(2.0).mean()
    loss.backward()
    a.data -= lr * a.grad.data
    print(f"{loss.item():7.3f} {a.data} {a.grad.data}") 
    a.grad.zero_()

plt.scatter(x, y)
plt.scatter(x, y_hat().detach().numpy())
plt.show()

You were zeroing the gradient for a before updating it.

The circularity error in calling backward() was due to the gradient of a being involved in the update. Using a.data instead avoids that.

There is no need to use the nn.Parameter wrapper. That allows the registration of parameter Tensors with a Module, which you are not using here.

As @MariosOreo mentioned, the gradients of Tensors don’t exist until backward() is called.
This happens to everyone I think when they first try PyTorch because most of the examples show the gradient being zeroed at the beginning of the optimization loop. That’s why I zero them at the end of the loop. It makes the code a bit cleaner to not test for the existence of a.grad on every iteration.

Your constant k doesn’t need to be explicitly a Tensor. PyTorch will broadcast it when you use it.

I took the liberty of adding an extra coefficient to a for another example.

MariosOreo · March 18, 2019, 10:23am

I think using with torch.no_grad() is a much better way than using .data which ptblck mentioned in this thread.

sparseinference · March 18, 2019, 10:26am

Yes, I agree. PyTorch is still evolving.

ankurji · September 30, 2020, 10:13am

a.grad.data.zero_() can work…