'CudnnConvolutionBackward' returned nan values in its 0th output


I am training an object detection model that has two losses, one of them tends to infinity but after normalization with the below commands it was fixed :

pi_minvals = pi[..., 4].min(3, keepdim=True).values
            pj_minvals = pi[..., 4].min(3, keepdim=True).values
            pi_maxvals = pi[..., 4].max(3, keepdim=True).values
            pj_maxvals = pi[..., 4].max(3, keepdim=True).values
            pi_norm = (pi[...,4]-pi_minvals)/(pi_maxvals - pi_minvals)
            pj_norm = (pj[...,4] - pj_minvals) / (pj_maxvals - pj_minvals)
            obji = self.BCEobj(pi_norm , pj_norm)

but still during first epoch get this error :

  File "/home/hoda/lib/python3.7/site-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/hoda/lib/python3.7/site-packages/torch/autograd/__init__.py", line 147, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: Function 'CudnnConvolutionBackward' returned nan values in its 0th output.
'CudnnConvolutionBackward' returned nan values in its 0th output. 

Do you have any idea how to solve it?


You can use set_detect_anaomly(True) and follow the examples in the documentation here

This should help you find which operation is creating the NaN.

thank you for your response, the mesage detect_ anomaly gives is:

one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [75, 512, 1, 1]] is at version 3; expected version 2 instead.
Hint: the backtrace further above shows the operation that failed to compute its gradient. 
The variable in question was changed in there or anywhere later.

Do you have any idea how can I find that variable?

Hi, I never got notified of your response so apologises for the delay,

So at some place in your code you’re calling inplace=True, make sure to set that to False. If I had to guess, given by the shape of the Tensor you have a Conv2D layer somewhere in your model and you pass the output through a ReLU function? with ReLU(inplace=True)?

Go through your code and set all inplace=True to inplace=False. If you could share your model code too that’d help find the error as well!