NaN values popping up during loss.backward()

James_Ko · December 10, 2020, 12:06am

I’m using CrossEntropyLoss with a batch size of 4. These are the predicted/actual labels I’m feeding to it along with the value of the loss:

preds:
 tensor([[-0.0052,  0.2059, -0.1473],
        [-0.0250,  0.0953,  0.0047],
        [ 0.0684,  0.1638, -0.0705],
        [-0.0195,  0.0100, -0.0874]], device='cuda:0', grad_fn=<AddmmBackward>)
target:
 tensor([2, 2, 2, 2], device='cuda:0')
loss: tensor(1.1942, device='cuda:0', grad_fn=<NllLossBackward>)

Here is the error message I’m getting after setting autograd.set_detect_anomaly(True):

Traceback (most recent call last):
  File "run_liunet.py", line 247, in <module>
    main()
  File "run_liunet.py", line 200, in main
    train(model, train_loader, optimizer, device, epoch, 'train', debug_mode)
  File "run_liunet.py", line 80, in train
    loss.backward()
  File "/home/jlko/miniconda3/envs/liuNetEnv/lib/python3.6/site-packages/torch/tensor.py", line 221, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/jlko/miniconda3/envs/liuNetEnv/lib/python3.6/site-packages/torch/autograd/__init__.py", line 132, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: Function 'NativeBatchNormBackward' returned nan values in its 0th output.

Here is the architecture of my neural network: https://raw.githubusercontent.com/Information-Fusion-Lab-Umass/mri-features/9ccf23cd0ecb4f163c76483fed52e4709b02ea4f/liunet.py. I am only using the self.conv and self.fc modules, so you can ignore all of the stuff related to self.age_encoder.

ptrblck · December 11, 2020, 7:31am

Did you make sure that no inputs contain invalid values, e.g. by applying torch.isfinite(input)?
Are you seeing an invalid output (in the model output or loss) before anomaly detection raises the error?

Chloe_Su · December 4, 2021, 4:54pm

Hi Ptrblock,

I met a similar problem as him.
I checked all the inputs are finite but the error persists.

Thanks

ptrblck · December 4, 2021, 9:41pm

Could you post a minimal, executable code snippet which would reproduce this issue, please?