How to compute loss.backward() in this case? "RuntimeError: there are no graph nodes that require computing gradients"

I’m not sure how to properly ask this question or if I’m missing something. I am training a RNN conditioned with images and text. In my training, I take the generated output from the rnn model, and then generate images from that. I then compute the loss, but I cant run loss.backward() and I dont know how to formulate the code to get the correct loss. Here is what I mean:

    criterion = nn.L1Loss()
    for epoch in range(args.num_epochs):
      for i, (images, captions, lengths) in enumerate(data_loader):
        
        images = Variable(images, volatile=False)
        captions = Variable(captions,volatile=False)
        
        # Forward, Backward and Optimize
        decoder.zero_grad()
        encoder.zero_grad()
        features = resnet152_encoder(images)
        outputs = lstm_decoder(features, captions, lengths)
        output_features = outputs_to_image_tensor(outputs)
        loss = criterion(output_features, images)
        loss.backward()
        optimizer.step()

With that code I get error: “RuntimeError: there are no graph nodes that require computing gradients” I think I understand why, it is because in the line " output_features = outputs_to_image_tensor(outputs)" I have done a transformation to the outputs that loses the weights. How can I do back propagation here? Or how do I connect output_features to outputs. What I essentially want it to update outputs weights based off loss of the original images with the outputted images. Does this make sense what Im saying?

2 Likes

Having the same issue where loss.backward() doesn’t create .grad on the variables that had requires_grad=True. The only reason I can determine is I modified the output used for the loss (this is a necessary step, so I don’t know any way around it), same as your “output_features = …” line.

If you solve, it would be great to post a solution.

You should at least show what does outputs_to_image_tensor() do. Maybe that step can be done directly into the forward method of lstm_decoder without breaking the graph

@alexis-jacq sorry, didnt think it was useful to show that, it has to cross processes, the function makes external calls outside of python and gets back images which I then convert to tensors. I can show if you think useful. I have no idea how to fix this or model the problem. But as I understand it now, I dont think its possible what Im trying to do? how can I create a relationship between output_features and outputs as outputs are the weights I need to update?

So is it the creating of new variables causing the problem with the graph?

In my case, the modification that would be the equivalent of deepcode’s are the making a new variable (from the x.scatter_() in code below) on the output of matrix multiplication. If I don’t do this, it works fine of course. It tells me the same error as @deepcode presumably because I have requires_grad=False on the new variables. Still, there is w1 and w2 with requires_grad=True so there should be graph nodes to compute.

Example Code:

w1 = Variable(torch.randn(shape[0], shape[1]).type(dtype), requires_grad=True)
w2 = Variable(torch.randn(shape[1], shape[2]).type(dtype), requires_grad=True)

x = Variable(torch.randn(N, shape[0]).type(dtype), requires_grad=False)
y = Variable(torch.randn(N, shape[2]).type(dtype), requires_grad=False)

learning_rate = 0.001
for t in range(epochs):
    l2 = x.mm(w1)
    #these next two lines cause problems
    topk, indices = torch.topk(l2, k[1])
    l2_sp = Variable(torch.zeros([N, H]).scatter_(1, indices.data, topk.data), requires_grad=False)
    l3 = l2.mm(w2)
    #same here
    topk, indices = torch.topk(l3, k[2])
    l3_sp = Variable(torch.zeros([N, D_out]).scatter_(1, indices.data, topk.data),requires_grad=False)
    
    loss = (l3_sp - y).abs().sum()
    loss.backward()

    w1.data -= learning_rate * w1.grad.data
    w2.data -= learning_rate * w2.grad.data

    w1.grad.data.zero_()
    w2.grad.data.zero_()

However, If I set the new variables that are created to have requires_grad=True, there’s no more of the first error, but rather the weights I’m actually interested in performing backprop on, w1 and w2 which originally had requires_grad=True never get their gradients initialized after loss.backward(). This is indicated by the error on the line

—> w1.data -= learning_rate * w1.grad.data

AttributeError: ‘NoneType’ object has no attribute ‘data’

I’ve also tried using my own class from nn.Module, and putting all that in the forward() but still causes the same problem.

After more studying ,it seems like implementing my own Layer would be the easiest way to do what Im trying to do, although im still not sure how I would implement this. I decided to use a simpler loss function for now as what I wanted to do would take too long to compute anyway.

My solution was to set the l2.data/l3.data to the new l2_sp/l3_sp values (not wrapped in Variable). This prevents new variable creation, and keeps the graph in tact.

@lmnt can I see what that code looks like?

What I wanted to do is set the top K values to 1’s and the rest to zero. Here I use topk() and scatter() to do the work. Take that resulting variable.data and set that to the l2.data (and l3). This keeps the graph in tact since l2 and l3 store graph and grad data for subsequent backprop.

x = Variable(torch.randn(N, shape[0]).type(dtype), requires_grad=False)
y = Variable(torch.randn(N, shape[2]).type(dtype), requires_grad=False)
k = [25, 50]

learning_rate = 0.001
for t in range(epochs):
    l2 = x.mm(w1)
    #these three lines solve the problem
    topk, indices = torch.topk(l2, k[1])
    topk.data = torch.ones(l2.size)
    l2.data = Variable(torch.zeros(l2.size)).scatter(1, indices, topk).data

    l3 = l2.mm(w2)
    #same here
    topk, indices = torch.topk(l3, k[1])
    topk.data = torch.ones(l3.size)
    l3.data = Variable(torch.zeros(l3.size)).scatter(1, indices, topk).data

    loss = (l3_sp - y).abs().sum()
    loss.backward()

    w1.data -= learning_rate * w1.grad.data
    w2.data -= learning_rate * w2.grad.data

    w1.grad.data.zero_()
    w2.grad.data.zero_()