About torch.autograd.set_detect_anomaly(True):

Yangmin · December 17, 2021, 2:43am

Hello.

I am training a CNN network with cross_entropy loss.
When I train the network with debugging tool wrapped up

“with torch.autograd.set_detect_anomaly(True):”

I get runtime error like this,

[W python_anomaly_mode.cpp:60] Warning: Error detected in CudnnConvolutionBackward. Traceback of forward call that caused the error
self.scaler.scale(self.losses).backward()
File “/root/anaconda3/envs/gcl/lib/python3.7/site-packages/torch/tensor.py”, line 185, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File “/root/anaconda3/envs/gcl/lib/python3.7/site-packages/torch/autograd/init.py”, line 127, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: Function ‘CudnnConvolutionBackward’ returned nan values in its 0th output.

I can see that the gradients are exploding? and getting an error like this but what I feel
weird is that when I don’t use “with torch.autograd.set_detect_anomaly(True):”

I do not get any error…

Why does only error come up when I use “with torch.autograd.set_detect_anomaly(True):”??
The final loss value is calculated properly I think. Is this NAN value occur because I did something wrong during forward pass like cutting the computational graph??

Thanks in advance.!