I think I misunderstood the output log. At the first epoch, network finishes forward pass and gives loss value as a finite number (tensor(0.2221,device='cuda:0',grad_fn=L1LossBackward>)
)
This means there aren’t invalid inputs at any layers. I also use torch.isfinite(out).all()
to check the activation output, the function give tensor(True, device='cuda:0')
.
So the problem is in backward pass, isn’t it? If so, whether I can use register_forward_hook
to figure out what layer cause the NaN gradient and narrow it down? Is there any trade-off or consequence?
This is the output log:
tensor(0.2221, device='cuda:0', grad_fn=<L1LossBackward>)
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-18-98005322b3ff> in <module>()
38 loss = criterion(output, target)
39 print(loss)
---> 40 scaler.scale(loss).backward()
41 scaler.step(optimizer)
42 scaler.update()
1 frames
/usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
147 Variable._execution_engine.run_backward(
148 tensors, grad_tensors_, retain_graph, create_graph, inputs,
--> 149 allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag
150
151
RuntimeError: Function 'MulBackward0' returned nan values in its 0th output.
Update
I used register_backward_hook
to hook the input/output grad of the last layer of my network using this code snippet:
def printgradnorm(self, grad_input, grad_output):
print('Inside ' + self.__class__.__name__ + ' backward')
print('Inside class:' + self.__class__.__name__)
print('')
print('grad_input: ', type(grad_input))
print('grad_input[0]: ', type(grad_input[0]))
print('grad_output: ', type(grad_output))
print('grad_output[0]: ', type(grad_output[0]))
print('')
print('grad_input size:', grad_input[0].size())
print('grad_output size:', grad_output[0].size())
print('grad_input norm:', grad_input[0].norm())
print('grad_output norm:', grad_output[0].norm())
print('')
print(torch.isfinite((grad_input[0])).all())
print(torch.isfinite((grad_output[0])).all())
model.convblock10.register_backward_hook(printgradnorm)
Without using AMP package, the output is:
loss value: tensor(0.3254, device='cuda:0', grad_fn=<L1LossBackward>)
Inside Conv2d backward
Inside class:Conv2d
grad_input: <class 'tuple'>
grad_input[0]: <class 'torch.Tensor'>
grad_output: <class 'tuple'>
grad_output[0]: <class 'torch.Tensor'>
grad_input size: torch.Size([2, 16, 256, 256])
grad_output size: torch.Size([2, 16, 256, 256])
grad_input norm: tensor(0.0007, device='cuda:0')
grad_output norm: tensor(0.0007, device='cuda:0')
is finite: tensor(True, device='cuda:0')
is finite: tensor(True, device='cuda:0')
and using AMP gives this output:
loss value: tensor(0.3358, device='cuda:0', grad_fn=<L1LossBackward>)
Inside Conv2d backward
Inside class:Conv2d
grad_input: <class 'tuple'>
grad_input[0]: <class 'torch.Tensor'>
grad_output: <class 'tuple'>
grad_output[0]: <class 'torch.Tensor'>
grad_input size: torch.Size([2, 16, 256, 256])
grad_output size: torch.Size([2, 16, 256, 256])
grad_input norm: tensor(45.2500, device='cuda:0', dtype=torch.float16)
grad_output norm: tensor(45.2500, device='cuda:0', dtype=torch.float16)
is finite: tensor(True, device='cuda:0')
is finite: tensor(True, device='cuda:0')
and to the next layer in the backward pass, the grad is NaN (inf).
How can I narrow down the exploding value in this case?
Once again, thank you so much!