RuntimeError: trying to differentiate twice a function that was markedwith @once_differentiable
I am not sure why one works and other (backward) doesn’t. From the error it is not clear which of the function is only differentiable once.
The reason I am cannot use torch.autograd.grad() for my problem is because I need to accumulate gradients for updating my model parameters, and this is not possible with grad() as only_inputs flag is deprecated in v0.4.
When you backprop a function that is a function of the gradients, you actually ask for higher order derivatives as you need to backprop through the backward pass itself.
Unfortunately all the Functions don’t support that.
You can enable the anomaly detection mode to know which forward function caused this issue.
My concern is, if there is a function which doesn’t support second order differentiation, even torch.autograd.grad() should throw an error. But that doesn’t seem to be the case (grad() runs fine and backward() doesn’t). I am trying to understand whats causing this.
Why do you use create_grad=True and retain_grad=True in the second call to autograd.grad() ?
If you don’t do the second autograd.grad, does the backward work?
I can’t really run your code as I don’t know what all the inputs/tensors are so I’m not sure what’s wrong here
Why do you use create_grad=True and retain_grad=True in the second call to autograd.grad()?
retain_graph defaults to create_graph, if thats provided. retain_graph = True is required if we need to compute higher order gradients. Hence I have retain_graph= True in the first call.
If you don’t do the second autograd.grad , does the backward work?
No. It still fails.
Another interesting observation is that when I set require_grad=False to all the parameters in model.parameters(), and then call loss.backward(retain_graph=True) I still get the same (once_differentiable) error. This is surprising because there are no variables for which .grad needs to be computed, but it still throws the same error. I was under the impression that in a backward() call, only gradients wrt all the parameters (with attribute requires_grad=True) are computed. Am I missing something here?