Backward() on parameter that requires_grad=True; but still get does not have a grad_fn

ElleryL · May 22, 2018, 3:11am

Following is a simple version of my original code.

Basically; the code does two things

(1) optimize the objective function use the gradient descent

(2) control the gradients computed from (1) by through a neural network

a = Variable(torch.Tensor([-2.]),requires_grad=True)
b = Variable(torch.Tensor([-2.]),requires_grad=True)
z = 2*x+10

nn = Control_Variate() # a simple neural network



var_opt = torch.optim.Adam(nn.parameters(), lr=0.1) # notice params are neural network weights
for i in range(1000):

    f_z = nn(a*x+b)
    
    der_shape = torch.ones(f_z.size())
    loss = torch.mean((f_z-z)**2)
    loss.backward(der_shape)
    gr_a = a.grad.clone()
    a.grad.data.zero_()
    gr_b = b.grad.clone()
    b.grad.data.zero_()
    g_a = Adam_Optim(gr_a) # my own Adam optimizer
    g_b = Adam_Optim(gr_b)
    
    b.data.sub_(g_b) # update a and b 
    a.data.sub_(g_a)

    
   # now, do the variance control 
    var_opt.zero_grad()
    var_loss = torch.mean((gr_a+gr_b-5)**2) # control the variance of changing weights 

    var_loss.backward()
    var_opt.step()
    


print("done")

The problem I have is that at
var_loss.backward()

It says that
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

but my backward() differentiates w.r.t. neural network weights; and gr_a and gr_b are function of my variable a and b where I have declare to be requries_grad=True

So I’m quit confused why I receive this error message

rasbt · May 22, 2018, 3:55am

The problem is that you created 2 tensors that have no grad function by calling the clone operation on the .data.grad tensor.

Below is a simplified, working example:

a = torch.tensor([-2.], requires_grad=True)
b = torch.tensor([-2.], requires_grad=True)


for i in range(1):

    var_loss = torch.tensor(a/b)
    var_loss.backward()
    a.grad.zero_()
    b.grad.zero_()


print("done")

No, to illustrate your code implementation in the context of this simplified example above, consider the following:

a = torch.tensor([-2.], requires_grad=True)
b = torch.tensor([-2.], requires_grad=True)


for i in range(1):

    var_loss = torch.tensor(a/b)
    var_loss.backward()
    
    gr_b = b.grad.clone()
    gr_a = a.grad.clone()
    
    a.grad.zero_()
    b.grad.zero_()
    
    var_loss2 = torch.tensor(gr_b/gr_a)
    var_loss2.backward()

print("done")

This code example would now fail with the same error message you got at var_loss2.backward() due to the missing grad_fn. How to fix this is by setting requires_grad=True for gr_b and gr_a:

gr_b = torch.tensor(b.grad.clone(), requires_grad=True)
gr_a = torch.tensor(a.grad.clone(), requires_grad=True)

ElleryL · May 22, 2018, 4:38am

thanks; this makes lot of sense to me now

ElleryL · May 22, 2018, 5:05am

can I ask for one more follow up questions?

In the code, when I first time do the backward()

loss.backward(der_shape)

I didn’t use retain_graph=True.

Second time, when I backward()

var_loss.backward()

It supposedly trying to backward through the graph a second time, because differentiate gr_a and gr_b involves parameters of neural network. Why didn’t it raise RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. error?

rasbt · May 22, 2018, 12:25pm

It supposedly trying to backward through the graph a second time, because differentiate gr_a and gr_b involves parameters of neural network. Why didn’t it raise RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. error?

I suppose because there are two separate graphs. I.e., the first graph for

loss = torch.mean((f_z-z)**2)

which was “done” (in the sense of “completed”) by calling

loss.backward(der_shape)

and then you build a second graph,

gr_a = a.grad.clone()
gr_b = b.grad.clone()
var_loss = torch.mean((gr_a+gr_b-5)**2)

that was then evaluated by var_loss.backward()
example, the following would work

a = torch.tensor([-2.], requires_grad=True)
b = torch.tensor([-2.], requires_grad=True)

var_loss = a/b
var_loss.backward()

var_loss = a/b
var_loss.backward()

but the following would cause the error you mentioned:

a = torch.tensor([-2.], requires_grad=True)
b = torch.tensor([-2.], requires_grad=True)

var_loss = a/b
var_loss.backward()

var_loss.backward()

Note in the last example, I don’t redefine var_loss and hence no new graph is built

ElleryL · May 23, 2018, 8:42pm

thanks so much for the clarification

ElleryL · May 23, 2018, 10:37pm

Just one more thing I have notice that

let’s say that I change the code to the following

when I first time do backward on my regular loss function

loss = torch.mean((f_z-z)**2)
loss.backward(der_shape,retain_graph=True) # first time loss

Then, do the variance control

    var_opt.zero_grad()
    var_loss = torch.mean((gr_a+gr_b-loss)**2). # use first loss value so retain_graph=True for the first time

please ignore the crazy loss func that doesn’t make sense; I’m simply playing around to make myself familiar with the torch system

Which makes sense; however, this time; it NO longer requires me to do

gr_b = torch.tensor(b.grad.clone(), requires_grad=True)
gr_a = torch.tensor(a.grad.clone(), requires_grad=True)

Instead, the code can work even with simply

    gr_a = a.grad.clone()
    a.grad.data.zero_()

    gr_b = b.grad.clone()
    b.grad.data.zero_()

But since my second time backward(), it simply incorporate the first loss func, its two loss function together; pytorch no longer requires to do the trick you taught me; makes me think that second time backward(); pytorch only did differentiation on first loss function; and for gr_a, gr_b; they simply treated as constant here and never been treated as part of variance loss function (the second loss func)?

to be more clear

when the second time backward()

    var_loss = torch.mean((gr_a+gr_b-loss)) # not requires gr_a gr_b to be requires_grad=True
    # does gr_a and gr_b treated as const here; how to make the backward() differentiate gr_a and gr_b?

    #var_loss = torch.mean(gr_a+gr_b) # must have gr_a gr_b to be requires_grad=True

Here is the full code

a = Variable(torch.Tensor([-2.]),requires_grad=True)
b = Variable(torch.Tensor([-2.]),requires_grad=True)
z = 2*x+10
zz = 3*x + 10
nn = Control_Variate() # a simple neural network


var_opt = torch.optim.Adam(nn.parameters(), lr=0.1)
for i in range(5000):
 

    f_z = nn(a*x+b)
    
    der_shape = torch.ones(f_z.size())
    loss = torch.mean((f_z-z)**2)
    loss.backward(der_shape,retain_graph=True)
    
    
    gr_a = a.grad.clone()
    a.grad.data.zero_()

    gr_b = b.grad.clone()
    b.grad.data.zero_()
    g_a,m_a,v_a = Adam_Optim(gr_a)
    g_b,m_b,v_b = Adam_Optim(gr_b)
    
    b.data.sub_(g_b)
    a.data.sub_(g_a)

    

    var_opt.zero_grad()
    
    var_loss = torch.mean((gr_a+gr_b-loss)) # not requires gr_a gr_b to be requires_grad=True
    #var_loss = torch.mean(gr_a+gr_b) # must have gr_a gr_b to be requires_grad=True
    
    var_loss.backward()
    var_opt.step()
    print(loss)
    


print("done")

because I kind of want the second time backward() do differentiate on gr_a and gr_b; now it looks like it doesn’t because here gr_a and gr_b not even variable but constant

rasbt · May 23, 2018, 11:06pm

I am not 100% sure if that’s what happening, but I’d say during the first round, where you have set retain_graph=True and requires_grad=True, it keeps the graph intact for the 2nd round. And since it doesn’t (need to) rebuild the graph for the second round then, due to retain_graph=True, it would ignore any requires_grad setting.

But maybe @ptrblck and @smth

can tell you more about the mechanics (or correct my [mis]interpretation)

ElleryL · May 23, 2018, 11:28pm

thanks for reply; so if your interpretation is correct; gr_a and gr_b is actually been treated as a function here and differentiation of each one w.r.t. neural network parameters does happen and the update of the parameters do consider gr_a and gr_b as loss func.

But with retain_graph = True in the first round of backward()

change the second loss to var_loss = torch.mean(gr_a+gr_b) still gives the same error.

So I guess retain_graph=True in the first round doesn’t really play too much here?

ptrblck · May 24, 2018, 12:07pm

I’m not sure if the code treats gr_a and gr_b right, since it seems they are constants for the backward pass.
At least playing around with the code, I saw the same error as @ElleryL explained, which indicates these tensors are not necessary for the gradient computation.

I didn’t understand all code samples, so sorry, if this misses the point, but is this basically, what you are trying to do (@ElleryL):

x = torch.tensor([1.], requires_grad=False)
w = torch.tensor([2.], requires_grad=True)
y = torch.tensor([1.])

for epoch in range(10):
    w.grad = None
    loss_w = ((x*w - y)**2).mean()
    print('Loss w: ', loss_w)
    loss_w.backward()
    print('Grad w: ', w.grad)
    
    w_grad = w.grad.clone()
    w_grad.requires_grad_(True)
    
    loss_w_grad = (w_grad**2).mean()
    loss_w_grad.backward()
    print('Loss w_grad: ', loss_w_grad)
    print('Grad w_grad: ', w_grad.grad)
    
    # update with w.grad.grad
    w.data.sub_(0.1 * (w.grad.data - w_grad.grad.data*0.1))
    print('W: ', w.data)

ElleryL · May 24, 2018, 3:59pm

thanks for the reply;
based on your code,

if we change loss_w_grad = (w_grad**2).mean()

to
loss_w_grad = ((w_grad+loss_w)**2).mean() by introducing the first loss function

so our code

loss_w.backward(retain_graph=True)
w_grad = w.grad.clone()

w_grad.requires_grad_(True)

loss_w_grad = ((w_grad+loss_w)**2).mean()

loss_w_grad.backward()

It makes me wonder that whether loss_w_grad.backward() differentiate w_grad but rather it as constant.

Because in my original code it would

z = function(x) # function involves a parameter called phi
f_z = Net(z) #neural net
der_shape = torch.ones(f_z.size())
f_z.backward(der_shape ,retain_graph=True) 
d_phi = phi.grad().clone()
d_phi.requires_grad_(True)
phi.grad.data.zero_()

g = f_z + d_phi

phi.data.sub_(0.01*g) # update phi

variance_loss = (g ** 2).mean() 

var_opt.zero_grad() # here var_opt = torch.optim.Adam(Net().parameters(), lr=0.01)
variance_loss.backward()
var_opt.step()

d_phi.grad.data.zero_()

it turns out that the results would be SAME for both d_phi.requires_grad_(True) and d_phi.requires_grad_(False) (need to comment out d_phi.grad.data.zero_()). So I just think that variance_loss.backward() only backpro on f_z; the value of d_phi has no influence of second backward().

My second backward() is focus on differentiate variance_loss w.r.t. parameters of neural net