Element 0 of tensors does not require grad and does not have a grad_fn at loss.backward()

Mrunalini_Ramnath · September 28, 2021, 3:12pm

Hey - I have a model that worked really well for segmentation. It would give pretty high iou scores in 5-6 epochs. I tried to re-train the model on another dataset 2 months later now and I’m now getting this error at loss.backward(). I have not changed the code in any way since then. I used a custom Dice Loss for the model. I set loss.requires_grad=True and it got rid of the error but now the model is more or less stuck at the same loss. I tried to re-train it on the same dataset but it’s giving the same issues. What could be causing this?

tom · September 28, 2021, 3:28pm

Did you insert torch.no_grad by accident or cast through integer tensors somewhere?

One simple approach to debugging this kind of thing is to print various intermediates and check that they have the “grad_fn=…” information (or just print t.grad_fn or t.requires_grad for them ). The operation that takes things requiring gradients and outputting things that don’t will be the one that trips you up.

Best regards

Thomas

Mrunalini_Ramnath · September 28, 2021, 3:35pm

I haven’t really made any changes to the code. I tried testing it out with BCEWithLogitsLoss() (which had also giveen pretty good results previously) in case there are any issues with just the Dice Loss but the loss isn’t really decreasing as such and it’s overall stuck at pretty low iou scores.

tom · September 28, 2021, 3:39pm

So just setting requires_grad on the output as you mentioned on the first post will stop the backward at that tensor, so it would not update the neural network.
So from your description you would want to find the operation breaking the autograd graph in order to get a meaningful backward pass again.

Mrunalini_Ramnath · October 7, 2021, 2:39am

Hey, I’ve gone over everything but I’m still unsure what could be breaking the autograd graph considering the exact same code worked perfectly a few weeks ago. I am running all of this on colab. Would that make any difference? (Would some automated version difference factor in? I really don’t understand what could be causing this as the exact same code worked well on the same dataset a while ago)