Hey - I have a model that worked really well for segmentation. It would give pretty high iou scores in 5-6 epochs. I tried to re-train the model on another dataset 2 months later now and I’m now getting this error at loss.backward(). I have not changed the code in any way since then. I used a custom Dice Loss for the model. I set loss.requires_grad=True and it got rid of the error but now the model is more or less stuck at the same loss. I tried to re-train it on the same dataset but it’s giving the same issues. What could be causing this?
Did you insert
torch.no_grad by accident or cast through integer tensors somewhere?
One simple approach to debugging this kind of thing is to print various intermediates and check that they have the “grad_fn=…” information (or just print t.grad_fn or t.requires_grad for them ). The operation that takes things requiring gradients and outputting things that don’t will be the one that trips you up.
I haven’t really made any changes to the code. I tried testing it out with BCEWithLogitsLoss() (which had also giveen pretty good results previously) in case there are any issues with just the Dice Loss but the loss isn’t really decreasing as such and it’s overall stuck at pretty low iou scores.
So just setting requires_grad on the output as you mentioned on the first post will stop the backward at that tensor, so it would not update the neural network.
So from your description you would want to find the operation breaking the autograd graph in order to get a meaningful backward pass again.
Hey, I’ve gone over everything but I’m still unsure what could be breaking the autograd graph considering the exact same code worked perfectly a few weeks ago. I am running all of this on colab. Would that make any difference? (Would some automated version difference factor in? I really don’t understand what could be causing this as the exact same code worked well on the same dataset a while ago)