Mixed training precision leads to nan of loss

ML0401 · July 20, 2020, 1:24pm

Hello,

to save memory I am trying to train a resnet18 model in half precision. Therefore, I converted my inputs and my model. I also changed the BachNorm2d layers back to normal floating tensors with the following lines:

resnet18.half()
for layer in resnet18.modules():
    if isinstance(layer, nn.BatchNorm2d):
        layer.float()

Nevertheless, I receive nan values as loss after the second batch is trained. Since, everything works find without half precision and the chance of instabilities during applying the BatchNorm2d, I was wondering if I am missing something or if there is another way?

ptrblck · July 22, 2020, 2:15am

Calling model.half() might work, but can easily create overflows and thus NaNs.
We recommend to use the automatic mixed-precision training, which takes care of these issues and also stabilizes the training using loss scaling.
You can use it in the nightly binaries or by building PyTorch from source.