About torch.autograd.set_detect_anomaly(True):


I am training a CNN network with cross_entropy loss.
When I train the network with debugging tool wrapped up

“with torch.autograd.set_detect_anomaly(True):”

I get runtime error like this,

[W python_anomaly_mode.cpp:60] Warning: Error detected in CudnnConvolutionBackward. Traceback of forward call that caused the error
File “/root/anaconda3/envs/gcl/lib/python3.7/site-packages/torch/tensor.py”, line 185, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File “/root/anaconda3/envs/gcl/lib/python3.7/site-packages/torch/autograd/init.py”, line 127, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: Function ‘CudnnConvolutionBackward’ returned nan values in its 0th output.

I can see that the gradients are exploding? and getting an error like this but what I feel
weird is that when I don’t use “with torch.autograd.set_detect_anomaly(True):”

I do not get any error…

  1. Why does only error come up when I use “with torch.autograd.set_detect_anomaly(True):”??

  2. The final loss value is calculated properly I think. Is this NAN value occur because I did something wrong during forward pass like cutting the computational graph??

Thanks in advance.!

  1. set_detect_anomaly(True) is used to explicitly raise an error with a stack trace to easier debug which operation might have created the invalid values. Without setting this global flag, the invalid values would just be created and the training might be broken (e.g. if you update any parameter to NaN).

  2. NaN values can be raised through multiple operations and often it’s used when e.g. you are dividing by zero in the model (or a very small value close to zero) and forget to add a small eps value to the divisor.

I understand that “set_detect_anomaly(True)” finds an error and shows up in more detail.

When I don’t use it, the invalid values are just created and sometimes do not occur error in the forward pass?

set_detect_anomaly won’t change the behavior of your code, just the error reporting. If you are not using it, the invalid values would also be created, but you would just “use” them, i.e. you might update your model parameters to NaNs etc.

I understand.
Thank you so much.!