FP16 gives NaN loss when using pre-trained model

I tried the new fp16 in native torch. However, when I continue my model training for my segmentation task I get loss as NaNs. With the same script, if I initialize the same model architecture from scratch then it works fine.
I set adam’s ‘ep’ to 1e-4 as well but it made no difference. I get NaN loss from the first batch continuing my trained model.

The model that I am continuing is already quite good at the task. I am guessing it is due to very low loss value but am not sure. How can I fix this?

Python 3.6
Pytorch 1.6
pip installed lightning
CUDA Version: 10.1

Could you please share your training loop?
Also basic autocast usage would be something like -

with autocast():
    out = model(x)
    loss = criterion(targets, out)
print(loss) #to debug

Sure here it is. As mentioned my model works when I train a new model from scratch. It is only when continuing my good model that i get the NaNs

from torch.cuda import amp
def mini_trainfp16(model,opt,scheduler,epochs,dataloaders,image_tracker=None,saver=None,saver2=None):
    print('------------------------------------------ Using fp16')
    scaler = amp.GradScaler()
    for epoch in range(epochs):
        for e in dataloaders.keys():
            model.train() if e=='train' else model.eval()
            curr_loader = tqdm(dataloaders[e], total=int(len(dataloaders[e])))
            for idx,data in enumerate(curr_loader):
                x = data['image'].float().to(device);labels_v=data['label'].double().to(device) 
                with amp.autocast():
                    out = model(x)
                    loss = muti_bce_loss_fusionfp16(out, labels_v) 
                if e=='train':

You’re scaling the losses again after the autocast region, which is redundant and might lead to wrong loss I believe.
Remove the scaler.scale(loss) and simply run loss.backward() outside the autocast, that might fix it…
A lot of training will lead to overfitting and negligible loss, but I don’t think it should result in Nan losses

This would point towards an overflow in some intermediate activations or are you also seeing a NaN output if you train the model from scratch?

No, the scaler is an essential part of mixed-precision training and should not be removed.
Here you can see the typical training loop with comments for each step.

1 Like

Oh completely missed it, I have been training mixed precisons models wrong all along, thank you @ptrblck for correcting and sorry for posting wrong information…

Happy to help! :slight_smile:
Which models were you training and was the mixed-precision training successful without the gradient scaling?

Mainly object detection and segmentation, specifically the fasterrcnn_resnet50_fpn from the torchvision models, though the results were not that bad either, but yes I believe it did start becoming Nan loss later down the training

Yes you are right, I am doing deep supervision. I have 5 outputs and some of the outputs are -inf, NaNs and some real values. The final loss is a sum of all losses so that too goes to NaN. How can I fix this? This only happens when continuing the training of my good model.

I am suprised that all are not NaNs. There are some valid outputs. I was suspecting it is due to a batchnorm layer having too low std but am unsure now. Any suggestions to debug?

You could register forward hooks for each module and check the output. Once you could isolate the first layer creating the invalid values, you could check the input as well as the parameters to further isolate it.