Hello.
I am training a CNN network with cross_entropy loss.
When I train the network with debugging tool wrapped up
“with torch.autograd.set_detect_anomaly(True):”
I get runtime error like this,
[W python_anomaly_mode.cpp:60] Warning: Error detected in CudnnConvolutionBackward. Traceback of forward call that caused the error
self.scaler.scale(self.losses).backward()
File “/root/anaconda3/envs/gcl/lib/python3.7/site-packages/torch/tensor.py”, line 185, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File “/root/anaconda3/envs/gcl/lib/python3.7/site-packages/torch/autograd/init.py”, line 127, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: Function ‘CudnnConvolutionBackward’ returned nan values in its 0th output.
I can see that the gradients are exploding? and getting an error like this but what I feel
weird is that when I don’t use “with torch.autograd.set_detect_anomaly(True):”
I do not get any error…
-
Why does only error come up when I use “with torch.autograd.set_detect_anomaly(True):”??
-
The final loss value is calculated properly I think. Is this NAN value occur because I did something wrong during forward pass like cutting the computational graph??
Thanks in advance.!