Before you call .backward(), the gradient of each tensor which requires_grad=True are all None.
Like the case you posted, you could calculate a.grad firstly and then zero_() its grad.
In opt.zero_grad() it declare explicity:

def zero_grad(self):
r"""Clears the gradients of all optimized :class:`torch.Tensor` s."""
for group in self.param_groups:
for p in group['params']:
if p.grad is not None:
p.grad.detach_()
p.grad.zero_()

Thanks for the feedback, it was None for sure and this caused the error.
Here is the full code of the problem.

import torch
import torch.nn as nn
import matplotlib.pyplot as plt
n = 70 # num of points
# x is a tensor
x = torch.linspace(0, 10, steps=n)
k = torch.tensor(2.5)
# y is a tensor
y = k*x + 5*torch.rand(n)
# loss function
def mse(y_hat, y): return ((y_hat-y)**2).mean()
a = torch.tensor(-1., requires_grad=True)
a = nn.Parameter(a)
y_hat = a*x
plt.scatter(x, y);
plt.scatter(x,y_hat.detach());
lr = 0.1
def update_a():
global a
loss = mse(y_hat, y)
print(loss)
loss.backward(a)
print(a.grad)
a.grad.zero_()
a = a - lr * a.grad
for t in range(10):
a = update_a()

Plan is to learn the curve direction that was originally -1 and it should be 2.5.

I just run your code snipper, but there is something wrong, it raised RuntimeError about call .backward on the computation graph which had been freed. And a.grad is still None except the first iteration, so it learned nothing. Additionally, your learning_rate is too high.
I write a snippet based on yours, and it works now.

n = 70 # num of points
# x is a tensor
x = torch.linspace(0, 10, steps=n)
k = torch.tensor(2.5)
# y is a tensor
y = k * x + 5 * torch.rand(n)
# loss function
def mse(y_hat, y):
return ((y_hat - y) ** 2).mean()
a = torch.tensor(-1., requires_grad=True)
a = nn.Parameter(a)
lr = 0.005
for t in range(10):
print(t, ':', a.grad, a.requires_grad)
if a.grad is not None:
a.grad.zero_()
y_hat = a * x
loss = mse(y_hat, y)
loss.backward()
#print(t, 'after backward:', a.grad, a.requires_grad, a.is_leaf)
a = (a.data - lr * a.grad).requires_grad_(True)
#print(t, 'after update:', a.requires_grad, a.is_leaf)
plt.scatter(x, y)
plt.scatter(x, y_hat.detach())
plt.show()

You should calculate y_hat in the loop, otherwise, there will raise a RuntimeError mentioned above.

If you assign a directly in each iteration, a will only have grad in the first iteration since a will be a non_leaf Variable and its grad will be None.

EDIT: oh, I forgot to zero a.grad and I have correct it in the snippet.

Well, @MariosOreo beat me to it, but here’s my rewrite of your code anyway:

import torch
import torch.nn as nn
import matplotlib.pyplot as plt
n = 70
x = torch.linspace(0, 10, steps=n)
k = 2.5
y = k*x + 5*torch.rand(n)
a = torch.tensor([-1.0, 0.0], requires_grad=True)
def y_hat():
return a[0] * x + a[1]
lr = 0.01
for t in range(10):
loss = (y_hat() - y).pow(2.0).mean()
loss.backward()
a.data -= lr * a.grad.data
print(f"{loss.item():7.3f} {a.data} {a.grad.data}")
a.grad.zero_()
plt.scatter(x, y)
plt.scatter(x, y_hat().detach().numpy())
plt.show()

You were zeroing the gradient for a before updating it.

The circularity error in calling backward() was due to the gradient of a being involved in the update. Using a.data instead avoids that.

There is no need to use the nn.Parameter wrapper. That allows the registration of parameter Tensors with a Module, which you are not using here.

As @MariosOreo mentioned, the gradients of Tensors don’t exist until backward() is called.
This happens to everyone I think when they first try PyTorch because most of the examples show the gradient being zeroed at the beginning of the optimization loop. That’s why I zero them at the end of the loop. It makes the code a bit cleaner to not test for the existence of a.grad on every iteration.

Your constant k doesn’t need to be explicitly a Tensor. PyTorch will broadcast it when you use it.

I took the liberty of adding an extra coefficient to a for another example.