FP16 gives NaN loss when using pre-trained model

bluesky314 · August 26, 2020, 5:33am

I tried the new fp16 in native torch. However, when I continue my model training for my segmentation task I get loss as NaNs. With the same script, if I initialize the same model architecture from scratch then it works fine.
I set adam’s ‘ep’ to 1e-4 as well but it made no difference. I get NaN loss from the first batch continuing my trained model.

The model that I am continuing is already quite good at the task. I am guessing it is due to very low loss value but am not sure. How can I fix this?

Python 3.6
Pytorch 1.6
Linux
pip installed lightning
CUDA Version: 10.1
Nvidia-V100

a_d · August 26, 2020, 6:33am

Could you please share your training loop?
Also basic autocast usage would be something like -

with autocast():
    out = model(x)
    loss = criterion(targets, out)
print(loss) #to debug
loss.backward()
optim.step()

bluesky314 · August 26, 2020, 6:54am

Sure here it is. As mentioned my model works when I train a new model from scratch. It is only when continuing my good model that i get the NaNs

from torch.cuda import amp
def mini_trainfp16(model,opt,scheduler,epochs,dataloaders,image_tracker=None,saver=None,saver2=None):
    print('------------------------------------------ Using fp16')
    scaler = amp.GradScaler()
    for epoch in range(epochs):
        for e in dataloaders.keys():
            model.train() if e=='train' else model.eval()
            curr_loader = tqdm(dataloaders[e], total=int(len(dataloaders[e])))
            for idx,data in enumerate(curr_loader):
                x = data['image'].float().to(device);labels_v=data['label'].double().to(device) 
                with amp.autocast():
                    out = model(x)
                    loss = muti_bce_loss_fusionfp16(out, labels_v) 
                
                if e=='train':
                    scaler.scale(loss).backward()
                    scaler.step(opt)
                    scaler.update()
                    opt.zero_grad()
                    scheduler.step()

a_d · August 26, 2020, 8:56am

You’re scaling the losses again after the autocast region, which is redundant and might lead to wrong loss I believe.
Remove the scaler.scale(loss) and simply run loss.backward() outside the autocast, that might fix it…
A lot of training will lead to overfitting and negligible loss, but I don’t think it should result in Nan losses

ptrblck · August 26, 2020, 9:16am

This would point towards an overflow in some intermediate activations or are you also seeing a NaN output if you train the model from scratch?

No, the scaler is an essential part of mixed-precision training and should not be removed.
Here you can see the typical training loop with comments for each step.

a_d · August 26, 2020, 9:36am

Oh completely missed it, I have been training mixed precisons models wrong all along, thank you @ptrblck for correcting and sorry for posting wrong information…

ptrblck · August 26, 2020, 9:37am

Happy to help!
Which models were you training and was the mixed-precision training successful without the gradient scaling?

a_d · August 26, 2020, 9:42am

Mainly object detection and segmentation, specifically the fasterrcnn_resnet50_fpn from the torchvision models, though the results were not that bad either, but yes I believe it did start becoming Nan loss later down the training

bluesky314 · August 26, 2020, 2:54pm

Yes you are right, I am doing deep supervision. I have 5 outputs and some of the outputs are -inf, NaNs and some real values. The final loss is a sum of all losses so that too goes to NaN. How can I fix this? This only happens when continuing the training of my good model.

I am suprised that all are not NaNs. There are some valid outputs. I was suspecting it is due to a batchnorm layer having too low std but am unsure now. Any suggestions to debug?

ptrblck · August 26, 2020, 6:24pm

You could register forward hooks for each module and check the output. Once you could isolate the first layer creating the invalid values, you could check the input as well as the parameters to further isolate it.