I have noticed that there are NaNs in the gradients of my model. This is confirmed by torch.autograd.detect_anomaly()
:
RuntimeError: Function 'DivBackward0' returned nan values in its 1th output.
I do not know which division causes the problem since DivBackward0 does not seem to be a unique name. However, I have added asserts to all divisions (like assert torch.all(divisor != 0)
) and also have lots of asserts to check for NaNs in general (like assert torch.all(~torch.isnan(t))
).
I also iterate the graph and register hooks which print the function and check for NaNs with the following code:
def iter_graph(root, callback):
queue = [root]
seen = set()
while queue:
fn = queue.pop()
if fn in seen:
continue
seen.add(fn)
for next_fn, _ in fn.next_functions:
if next_fn is not None:
queue.append(next_fn)
callback(fn)
def register_hooks(var):
fn_dict = {}
def hook_cb(fn):
def register_grad(grad_input, grad_output):
print(fn)
assert all(t is None or torch.all(~torch.isnan(t)) for t in grad_input), f"{fn} grad_input={grad_input} grad_output={grad_output}"
assert all(t is None or torch.all(~torch.isnan(t)) for t in grad_output), f"{fn} grad_input={grad_input} grad_output={grad_output}"
fn_dict[fn] = grad_input
fn.register_hook(register_grad)
iter_graph(var.grad_fn, hook_cb)
The output looks like this:
<ViewBackward object at 0x7fb79bae50d0>
<SubBackward0 object at 0x7fb79bae5130>
<DivBackward0 object at 0x7fb79bae51c0>
<DivBackward0 object at 0x7fb79bae5190>
<SliceBackward object at 0x7fb79bae50a0>
<SliceBackward object at 0x7fb79bae5400>
<ViewBackward object at 0x7fb79badcfd0>
...
<SigmoidBackward object at 0x7fb79bacc3d0>
<AddmmBackward object at 0x7fb79bacc430>
<TBackward object at 0x7fb79bacc4c0>
<CudnnBatchNormBackward object at 0x7fb79bacc490>
...
<torch.autograd.function.BilinearInterpolationBackward object at 0x7fb8cc79c4a0>
<torch.autograd.function.BilinearInterpolationBackward object at 0x7fb8cc79c3c0>
<torch.autograd.function.BilinearInterpolationBackward object at 0x7fb8cc79c2e0>
And then it fails with:
AssertionError: <torch.autograd.function.BilinearInterpolationBackward object at 0x7fb8cc79c2e0> grad_input=(tensor([[0.],
[0.],
[0.],
[0.],
[0.],
[nan],
[0.],
[0.],
[0.],
[nan]], device='cuda:0'), tensor([[0.],
[0.],
[0.],
[0.],
[0.],
[nan],
[0.],
[0.],
[0.],
[nan]], device='cuda:0'), None, None, None, None, None, None, None, None, None) grad_output=(tensor([[ 0.0000e+00, -7.8456e+29],
[ 0.0000e+00, 2.4914e+31],
[ 0.0000e+00, 0.0000e+00],
[ 0.0000e+00, -2.4474e+30],
[ 0.0000e+00, 5.9677e+30],
[ nan, 0.0000e+00],
[ 0.0000e+00, 0.0000e+00],
[ 0.0000e+00, 9.7542e+30],
[ 0.0000e+00, -2.9419e+30],
[ nan, 0.0000e+00]], device='cuda:0'),)
Interestingly, it does not fail immediately after DivBackward0
. BilinearInterpolationBackward
has NaNs in both the inputs and the output, which means that it does not cause the problem either.
I am pretty lost at this point. What else can I do to track down the NaN gradients?
Edit:
- Checking for Inf does not help
- Bigger batch size does not help
Edit 2:
If I disable cuDNN with torch.backends.cudnn.enabled = False
, then I get infinity in a MulBackward0
. Investigating further.