# Zero grad on single parameter

Hi,
I found this this code to zero the gradients on single parameter:
`a.grad.zero_()`

But it is not working: `AttributeError: 'NoneType' object has no attribute 'zero_'`

I previously declared:

``````a = torch.tensor(-1., requires_grad=True)
a = nn.Parameter(a)
``````
1 Like

The gradient will be computed when you call:

``````a.backward()
``````
1 Like

Hi,

Before you call `.backward()`, the gradient of each tensor which requires_grad=True are all `None`.
Like the case you posted, you could calculate a.grad firstly and then `zero_()` its grad.
In `opt.zero_grad()` it declare explicity:

``````    def zero_grad(self):
r"""Clears the gradients of all optimized :class:`torch.Tensor` s."""
for group in self.param_groups:
for p in group['params']:
``````

4 Likes

Thanks for the feedback, it was None for sure and this caused the error.
Here is the full code of the problem.

``````import torch
import torch.nn as nn
import matplotlib.pyplot as plt

n = 70 # num of points

# x is a tensor
x = torch.linspace(0, 10, steps=n)
k = torch.tensor(2.5)

# y is a tensor
y = k*x + 5*torch.rand(n)

# loss function
def mse(y_hat, y): return ((y_hat-y)**2).mean()

a = nn.Parameter(a)

y_hat = a*x

plt.scatter(x, y);
plt.scatter(x,y_hat.detach());

lr = 0.1

def update_a():
global a
loss = mse(y_hat, y)
print(loss)
loss.backward(a)
a = a - lr * a.grad

for t in range(10):
a = update_a()

``````

Plan is to learn the curve direction that was originally `-1` and it should be `2.5`.

Green should be more like blue dots.

I use the mse loss function, and I expect to learn the parameter `a`.

Hi,

I just run your code snipper, but there is something wrong, it raised RuntimeError about call `.backward` on the computation graph which had been freed. And `a.grad` is still None except the first iteration, so it learned nothing. Additionally, your learning_rate is too high.
I write a snippet based on yours, and it works now.

`````` n = 70  # num of points
# x is a tensor
x = torch.linspace(0, 10, steps=n)
k = torch.tensor(2.5)
# y is a tensor
y = k * x + 5 * torch.rand(n)

# loss function
def mse(y_hat, y):
return ((y_hat - y) ** 2).mean()

a = nn.Parameter(a)
lr = 0.005

for t in range(10):
y_hat = a * x
loss = mse(y_hat, y)
loss.backward()

plt.scatter(x, y)
plt.scatter(x, y_hat.detach())
plt.show()
``````
• You should calculate y_hat in the loop, otherwise, there will raise a RuntimeError mentioned above.
• If you assign `a` directly in each iteration, `a` will only have grad in the first iteration since `a` will be a non_leaf Variable and its grad will be None. EDIT: oh, I forgot to zero `a.grad` and I have correct it in the snippet.

3 Likes

Phenomenal update. You anticipated the problems I had and helped me.
I learned much from your code.
I noticed that this also works:

``````    with torch.no_grad():
``````

once replaced this code:
`a = (a.data - lr * a.grad).requires_grad_(True)`

Well, @MariosOreo beat me to it, but here’s my rewrite of your code anyway:

``````import torch
import torch.nn as nn
import matplotlib.pyplot as plt

n = 70
x = torch.linspace(0, 10, steps=n)
k = 2.5

y = k*x + 5*torch.rand(n)

def y_hat():
return a * x + a

lr = 0.01
for t in range(10):
loss = (y_hat() - y).pow(2.0).mean()
loss.backward()

plt.scatter(x, y)
plt.scatter(x, y_hat().detach().numpy())
plt.show()
``````

You were zeroing the gradient for `a` before updating it.

The circularity error in calling `backward()` was due to the gradient of `a` being involved in the update. Using `a.data` instead avoids that.

There is no need to use the `nn.Parameter` wrapper. That allows the registration of parameter Tensors with a `Module`, which you are not using here.

As @MariosOreo mentioned, the gradients of Tensors don’t exist until `backward()` is called.
This happens to everyone I think when they first try PyTorch because most of the examples show the gradient being zeroed at the beginning of the optimization loop. That’s why I zero them at the end of the loop. It makes the code a bit cleaner to not test for the existence of `a.grad` on every iteration.

Your constant `k` doesn’t need to be explicitly a Tensor. PyTorch will broadcast it when you use it.

I took the liberty of adding an extra coefficient to `a` for another example.

3 Likes

I think using `with torch.no_grad()` is a much better way than using `.data` which ptblck mentioned in this thread.

2 Likes

Yes, I agree. PyTorch is still evolving.

1 Like