Loss nan when resuming from a pretrained model

When I trained resnet18 on ImageNet, I stop it at epoch 30. Then in a later period, i train it again resuming from the pretrained model(epoch 30). However, during the training process, the training loss always turns to nan half way. And I try three times, it always has the same problem(loss nan at different iteration of epoch 30)
The training code is the official training example at examples/imagenet/main.py at main · pytorch/examples · GitHub
I don’t know how the solve the problem. If someone can help me, I will be quite grateful!

Could you check the input for invalid values via:

print(torch.isfinite(input), torch.isfinite(target))

Did you change anything else in the code?

The values all show true by using

print(torch.isfinite(input), torch.isfinite(target))

I change the code a little only for recording the accuracy and not showing the Data time. I think it doesnot effect the training process.

And now, when I train it from scratch by using the official code(not change anything this time), I always get nan at iteration 2

after using this script

torch.autograd.set_detect_anomaly(True)
with torch.autograd.detect_anomaly():
	loss.backward()

I got the error message as follows:

(pytorch) shu@hec02:~/GANs/Image Classification/ImageNet$ python main_official.py  /nfs/home/Imagenet_ILSVRC2012
=> creating model 'resnet18'


        ...,


        [[[True, True, True,  ..., True, True, True],
          [True, True, True,  ..., True, True, True],
          [True, True, True,  ..., True, True, True],
          ...,
          [True, True, True,  ..., True, True, True],
          [True, True, True,  ..., True, True, True],
          [True, True, True,  ..., True, True, True]],

         [[True, True, True,  ..., True, True, True],
          [True, True, True,  ..., True, True, True],
          [True, True, True,  ..., True, True, True],
          ...,
          [True, True, True,  ..., True, True, True],
          [True, True, True,  ..., True, True, True],
          [True, True, True,  ..., True, True, True]],

         [[True, True, True,  ..., True, True, True],
          [True, True, True,  ..., True, True, True],
          [True, True, True,  ..., True, True, True],
          ...,
          [True, True, True,  ..., True, True, T
          [True, True, True,  ..., True, True, True],
          [True, True, True,  ..., True, True, True]]]]) tensor([True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True])
/nfs/home/shu/.local/lib/python3.6/site-packages/torch/autograd/anomaly_mode.py:70: UserWarning: Anomaly Detection has been enabled. This mode will increase the runtime and should only be enabled for debugging.
  warnings.warn('Anomaly Detection has been enabled. '
Epoch: [0][   0/5005]   Time  5.869 ( 5.869)    Data  4.310 ( 4.310)    Loss 7.0919e+00 (7.0919e+00)    Acc@1   0.78 (  0.78)       Acc@5   0.78 (  0.78)


        ...,


        [[[True, True, True,  ..., True, True, True],
          [True, True, True,  ..., True, True, True],
          [True, True, True,  ..., True, True, True],
          ...,
          [True, True, True,  ..., True, True, True],
          [True, True, True,  ..., True, True, True],
          [True, True, True,  ..., True, True, True]],

         [[True, True, True,  ..., True, True, True],
          [True, True, True,  ..., True, True, True],
          [True, True, True,  ..., True, True, True],
          ...,
          [True, True, True,  ..., True, True, True],
          [True, True, True,  ..., True, True, True],
          [True, True, True,  ..., True, True, True]],

         [[True, True, True,  ..., True, True, True],
          [True, True, True,  ..., True, True, True],
          [True, True, True,  ..., True, True, True],
          ...,
          [True, True, True,  ..., True, True, T
          [True, True, True,  ..., True, True, True],
          [True, True, True,  ..., True, True, True]]]]) tensor([True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True])
Warning: Error detected in LogSoftmaxBackward. Traceback of forward call that caused the error:
  File "main_official.py", line 439, in <module>
    main()
  File "main_official.py", line 118, in main
    main_worker(args.gpu, ngpus_per_node, args)
  File "main_official.py", line 251, in main_worker
    train(train_loader, model, criterion, optimizer, epoch, args)
  File "main_official.py", line 299, in train
    loss = criterion(output, target)
  File "/nfs/home/shu/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/nfs/home/shu/.local/lib/python3.6/site-packages/torch/nn/modules/loss.py", line 932, in forward
    ignore_index=self.ignore_index, reduction=self.reduction)
  File "/nfs/home/shu/.local/lib/python3.6/site-packages/torch/nn/functional.py", line 2317, in cross_entropy
    return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
  File "/nfs/home/shu/.local/lib/python3.6/site-packages/torch/nn/functional.py", line 1535, in log_softmax
    ret = input.log_softmax(dim)
 (print_stack at /pytorch/torch/csrc/autograd/python_anomaly_mode.cpp:60)
Traceback (most recent call last):
  File "main_official.py", line 439, in <module>
    main()
  File "main_official.py", line 118, in main
    main_worker(args.gpu, ngpus_per_node, args)
  File "main_official.py", line 251, in main_worker
    train(train_loader, model, criterion, optimizer, epoch, args)
  File "main_official.py", line 310, in train
    loss.backward()
  File "/nfs/home/shu/.local/lib/python3.6/site-packages/torch/tensor.py", line 198, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/nfs/home/shu/.local/lib/python3.6/site-packages/torch/autograd/__init__.py", line 100, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: Function 'LogSoftmaxBackward' returned nan values in its 0th output.

I do not know what is wrong with it.

Thanks for the update. Could you add an .all() operation to the check and also check the output tensor as:

print(torch.isfinite(input).all(), torch.isfinite(target).all(), torch.isfinite(output).all())

It seems that the loss is getting a NaN value, so I guess that the model might output NaNs.

Did you make any changes to the model. It might have something to do with final activation layer.

I did not change the model, the model is created in the code below

print("=> creating model '{}'".format(args.arch))
        model = models.__dict__[args.arch]()

and because this problem happens sometimes, and sometimes not. Now, it seems to work fine, therefore, I will update the error message if I meet nan again. Thanks very much! @ptrblck @ZdsAlpha :slight_smile:

Epoch: [1][2120/5005]   Loss 3.5405e+00 (3.6588e+00)    Acc@1  25.00 ( 25.53)   Acc@5  51.56 ( 49.25)
tensor(True) tensor(True, device='cuda:0') tensor(True, device='cuda:0')
tensor(True) tensor(True, device='cuda:0') tensor(True, device='cuda:0')
tensor(True) tensor(True, device='cuda:0') tensor(True, device='cuda:0')
tensor(True) tensor(True, device='cuda:0') tensor(True, device='cuda:0')
tensor(True) tensor(True, device='cuda:0') tensor(True, device='cuda:0')
tensor(True) tensor(True, device='cuda:0') tensor(True, device='cuda:0')
tensor(True) tensor(True, device='cuda:0') tensor(True, device='cuda:0')
tensor(True) tensor(True, device='cuda:0') tensor(True, device='cuda:0')
tensor(True) tensor(True, device='cuda:0') tensor(True, device='cuda:0')
tensor(True) tensor(True, device='cuda:0') tensor(True, device='cuda:0')
Epoch: [1][2130/5005]   Loss 3.7294e+00 (3.6584e+00)    Acc@1  23.05 ( 25.54)   Acc@5  50.39 ( 49.26)
tensor(True) tensor(True, device='cuda:0') tensor(True, device='cuda:0')
tensor(True) tensor(True, device='cuda:0') tensor(True, device='cuda:0')
tensor(True) tensor(True, device='cuda:0') tensor(True, device='cuda:0')
tensor(True) tensor(True, device='cuda:0') tensor(True, device='cuda:0')
tensor(True) tensor(True, device='cuda:0') tensor(True, device='cuda:0')
tensor(True) tensor(True, device='cuda:0') tensor(True, device='cuda:0')
tensor(True) tensor(True, device='cuda:0') tensor(False, device='cuda:0')
tensor(True) tensor(True, device='cuda:0') tensor(False, device='cuda:0')
tensor(True) tensor(True, device='cuda:0') tensor(False, device='cuda:0')
tensor(True) tensor(True, device='cuda:0') tensor(False, device='cuda:0')
Epoch: [1][2140/5005]   Loss nan (nan)  Acc@1   0.00 ( 25.49)   Acc@5   1.17 ( 49.18)
tensor(True) tensor(True, device='cuda:0') tensor(False, device='cuda:0')
tensor(True) tensor(True, device='cuda:0') tensor(False, device='cuda:0')
tensor(True) tensor(True, device='cuda:0') tensor(False, device='cuda:0')
tensor(True) tensor(True, device='cuda:0') tensor(False, device='cuda:0')
tensor(True) tensor(True, device='cuda:0') tensor(False, device='cuda:0')
tensor(True) tensor(True, device='cuda:0') tensor(False, device='cuda:0')
tensor(True) tensor(True, device='cuda:0') tensor(False, device='cuda:0')
tensor(True) tensor(True, device='cuda:0') tensor(False, device='cuda:0')
tensor(True) tensor(True, device='cuda:0') tensor(False, device='cuda:0')
tensor(True) tensor(True, device='cuda:0') tensor(False, device='cuda:0')
Epoch: [1][2150/5005]   Loss nan (nan)  Acc@1   0.00 ( 25.38)   Acc@5   0.78 ( 48.96)
tensor(True) tensor(True, device='cuda:0') tensor(False, device='cuda:0')
tensor(True) tensor(True, device='cuda:0') tensor(False, device='cuda:0')
tensor(True) tensor(True, device='cuda:0') tensor(False, device='cuda:0')
tensor(True) tensor(True, device='cuda:0') tensor(False, device='cuda:0')
tensor(True) tensor(True, device='cuda:0') tensor(False, device='cuda:0')
tensor(True) tensor(True, device='cuda:0') tensor(False, device='cuda:0')
tensor(True) tensor(True, device='cuda:0') tensor(False, device='cuda:0')
tensor(True) tensor(True, device='cuda:0') tensor(False, device='cuda:0')
tensor(True) tensor(True, device='cuda:0') tensor(False, device='cuda:0')
tensor(True) tensor(True, device='cuda:0') tensor(False, device='cuda:0')
Epoch: [1][2160/5005]   Loss nan (nan)  Acc@1   0.00 ( 25.26)   Acc@5   0.78 ( 48.73)
tensor(True) tensor(True, device='cuda:0') tensor(False, device='cuda:0')
tensor(True) tensor(True, device='cuda:0') tensor(False, device='cuda:0')
tensor(True) tensor(True, device='cuda:0') tensor(False, device='cuda:0')
tensor(True) tensor(True, device='cuda:0') tensor(False, device='cuda:0')
tensor(True) tensor(True, device='cuda:0') tensor(False, device='cuda:0')
tensor(True) tensor(True, device='cuda:0') tensor(False, device='cuda:0')
tensor(True) tensor(True, device='cuda:0') tensor(False, device='cuda:0')
tensor(True) tensor(True, device='cuda:0') tensor(False, device='cuda:0')
tensor(True) tensor(True, device='cuda:0') tensor(False, device='cuda:0')
tensor(True) tensor(True, device='cuda:0') tensor(False, device='cuda:0')

I got this message when add this code.

For some reason the output is getting NaN values.
Could you break the training loop, after you’ve encountered the first NaN and check all parameters of the model?
E.g. you could print their abs().max() via:

for name, param in model.named_parameters():
    print(name, param.abs().max())

If this looks alright, you could repeat the last forward iteration (since the input contains valid values) and check all intermediate activations to narrow down, which layer creates the NaN outputs using forward hooks as described here.

Hi @cindybrain , Could you please advise if you found a solution to this? I’m also encountering a similar situation, where model runs fine on some occasions and gives nan on other occasions.