Training with Half Precision

ptrblck · June 23, 2021, 6:31am

Could you check if the output of the model is already creating invalid values?
If so, could you check the intermediate activation values for any invalid values (e.g. using torch.isfinite(out).all()) to narrow down the first occurrence?

quangduyhcmut · June 25, 2021, 6:42pm

What do you mean by “first occurrence”? I use torch.autograd.set_detect_anomaly(True) and the output said there is NaN problem with sqrtBackward, or addBackWard or CuDnnConvolutionBackward sometime at the 0th input.
I think this is the amp’s problem just because it doesn’t happen when I turn AMP off.
Thank you!

ptrblck · June 25, 2021, 11:37pm

By “first occurrence” I meant the first activation which shows an invalid value to narrow down the operation.
Since you are apparently seeing different operations at the moment, this would help narrow down the offending operation (e.g. an eps value used in sqrt might be too small when using amp and could thus underflow).

quangduyhcmut · June 26, 2021, 2:25am

I feel more clear now!
Let assume that I have many layers that use sqrt operation. So how can I detect which one cause the overflow/underflow problem?
Is there any possible way to find it without modifying the forward pass of each layer to figure out the first occurrence?
Thank you!

ptrblck · June 26, 2021, 4:56am

You could use forward hooks as described here which would allow you to check the outputs without changing the forward function in case you are using nn.Modules.

quangduyhcmut · June 26, 2021, 7:34am

Thank you so much. I will try and update the result!

quangduyhcmut · June 26, 2021, 10:02am

I think I misunderstood the output log. At the first epoch, network finishes forward pass and gives loss value as a finite number (tensor(0.2221,device='cuda:0',grad_fn=L1LossBackward>))

This means there aren’t invalid inputs at any layers. I also use torch.isfinite(out).all() to check the activation output, the function give tensor(True, device='cuda:0').
So the problem is in backward pass, isn’t it? If so, whether I can use register_forward_hook to figure out what layer cause the NaN gradient and narrow it down? Is there any trade-off or consequence?

This is the output log:

tensor(0.2221, device='cuda:0', grad_fn=<L1LossBackward>)
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-18-98005322b3ff> in <module>()
     38       loss = criterion(output, target)
     39       print(loss)
---> 40     scaler.scale(loss).backward()
     41     scaler.step(optimizer)
     42     scaler.update()

1 frames
/usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
    147     Variable._execution_engine.run_backward(
    148         tensors, grad_tensors_, retain_graph, create_graph, inputs,
--> 149         allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
    150 
    151 

RuntimeError: Function 'MulBackward0' returned nan values in its 0th output.

Update
I used register_backward_hook to hook the input/output grad of the last layer of my network using this code snippet:

def printgradnorm(self, grad_input, grad_output):
    print('Inside ' + self.__class__.__name__ + ' backward')
    print('Inside class:' + self.__class__.__name__)
    print('')
    print('grad_input: ', type(grad_input))
    print('grad_input[0]: ', type(grad_input[0]))
    print('grad_output: ', type(grad_output))
    print('grad_output[0]: ', type(grad_output[0]))
    print('')
    print('grad_input size:', grad_input[0].size())
    print('grad_output size:', grad_output[0].size())
    print('grad_input norm:', grad_input[0].norm())
    print('grad_output norm:', grad_output[0].norm())

    print('')
    print(torch.isfinite((grad_input[0])).all())
    print(torch.isfinite((grad_output[0])).all())

model.convblock10.register_backward_hook(printgradnorm)

Without using AMP package, the output is:

loss value:  tensor(0.3254, device='cuda:0', grad_fn=<L1LossBackward>)
Inside Conv2d backward
Inside class:Conv2d

grad_input:  <class 'tuple'>
grad_input[0]:  <class 'torch.Tensor'>
grad_output:  <class 'tuple'>
grad_output[0]:  <class 'torch.Tensor'>

grad_input size: torch.Size([2, 16, 256, 256])
grad_output size: torch.Size([2, 16, 256, 256])
grad_input norm: tensor(0.0007, device='cuda:0')
grad_output norm: tensor(0.0007, device='cuda:0')

is finite:  tensor(True, device='cuda:0')
is finite:  tensor(True, device='cuda:0')

and using AMP gives this output:

loss value:  tensor(0.3358, device='cuda:0', grad_fn=<L1LossBackward>)
Inside Conv2d backward
Inside class:Conv2d

grad_input:  <class 'tuple'>
grad_input[0]:  <class 'torch.Tensor'>
grad_output:  <class 'tuple'>
grad_output[0]:  <class 'torch.Tensor'>

grad_input size: torch.Size([2, 16, 256, 256])
grad_output size: torch.Size([2, 16, 256, 256])
grad_input norm: tensor(45.2500, device='cuda:0', dtype=torch.float16)
grad_output norm: tensor(45.2500, device='cuda:0', dtype=torch.float16)

is finite:  tensor(True, device='cuda:0')
is finite:  tensor(True, device='cuda:0')

and to the next layer in the backward pass, the grad is NaN (inf).

How can I narrow down the exploding value in this case?
Once again, thank you so much!

ptrblck · June 26, 2021, 8:27pm

You would have to be a bit careful when to check for invalid gradients during mixed-precision training.
The important part is that the forward pass is not creating invalid values, as this would point towards an overflow and you should then narrow it down using the aforementioned forward hooks.
However, based on your description it seems that the forward pass does not yield any invalid values, but anomaly detection triggers during the backward pass.
This is expected for the first few iterations when using the GradScaler with the default scale value.
The loss will be scaled by init_scale=65536.0 initially. This could overflow the gradients, the scaler.step(optimizer) will check for these invalid gradients, skip the optimizer.step() operation, and lower the scale value. The parameters would thus never be updated with the invalid gradients.
If you want to avoid the initially skipped steps, you could set a lower init_scale to avoid this behavior.

With that being said, in your first post you’ve mentioned that the “loss function gave NaN” values, which points towards the forward pass.

ecolss · July 10, 2021, 1:28pm

Why float16 would cause convergence issue for BN?