Prevent loss divergence during training while using AMP

Tariq · November 25, 2022, 8:57am

Hello all,

I recently added AMP support for my code training a segmentation model with a resnet-50 backbone.
Everything seems to be working well at the beginning, except that the loss diverges after some iterations.

The orange curve corresponds to the training loss w.o mixed precision, the grey one uses mixed precision with the default parameters and the purple one has init_scale set to 8192.

I am not sure which parameters I should tweak in order to get it to work correctly and wasn’t able to find an explanation of the effect of the GradScaler parameters online.

What could cause this divergence and how can I prevent it ?

Thanks,

ptrblck · November 25, 2022, 8:58pm

Mixed-precision training should not cause any divergence and should be able to reach a similar FP32 accuracy metric.
Could you describe your use case a bit more and what the curves represent?
It’s especially interesting since ResNet50 is being trained in mixed-precision for years without seeing failures.
Also, which GPU and dtype (float16 or bfloat16) are you using?

Tariq · November 28, 2022, 10:09am

Hello,

I am training a light-weight refinenet with a resnet-50 backbone for simultaneous segmentation (20 channels/classes) and regression (one additional channel).

The training objective consists of two losses, focal loss for segmentation and scale invariant loss for regression of a distance map.

Training without mixed precision works very well but adding mixed precision makes it diverge.

I tracked the evolution of the scale factor during training and it seems that it just keeps growing and that causes the problem in my case.

In the following I set init_scale = 500 and growth_interval=10000 which seems to behave better but still suffers from the same divergence problem.

(l1 loss is not part of the training objective).

Also, I am using torch 1.9.1 with cuda 11.0 and a Titan RTX GPU for this task.

ptrblck · November 29, 2022, 12:34am

A large scaling factor is beneficial as long as no overflows are seen as it would avoid underflows in the gradient calculation. Once the scaling factor is increased and is indeed causing overfllows, the optimizer.step() method will be skipped in scaler.step and the scale factor decreased again, thus the parameters won’t be updated at all.
I would recommend to update PyTorch to the latest release and check if you are still seeing the issue.

Tariq · November 30, 2022, 11:52am

Thank you for the reply.

I found that the code base that I was working with had implemented a nan_to_num operation that made the gradscaler unable to work correctly.

The code works fine now.