Getting NaNs only on GPU training

I am getting NaNs for loss after some time only on GPU training, the same code(with the same config) was working till last month. The cuda driver and PyTorch were updated during this period.

The loss increases exponentially with each step only on GPU.

Train Epoch: 1 [0/7146 (0%)] Loss: 0.176097 Train Acc: 0.187500
tensor(0.1642, device='cuda:0', grad_fn=<DivBackward0>)
tensor(8.2784, device='cuda:0', grad_fn=<DivBackward0>)
tensor(21.9132, device='cuda:0', grad_fn=<DivBackward0>)
tensor(51.5612, device='cuda:0', grad_fn=<DivBackward0>)
tensor(415.4600, device='cuda:0', grad_fn=<DivBackward0>)
tensor(17203.3320, device='cuda:0', grad_fn=<DivBackward0>)
tensor(nan, device='cuda:0', grad_fn=<DivBackward0>)

When I used autograd anomaly detection this was the output:
RuntimeError: Function 'CdistBackward' returned nan values in its 0th output.

I ran the training on CPU for 15 hours and it ran perfectly.

Train Epoch: 1 [0/7146 (0%)] Loss: 0.172225 Train Acc: 0.062500
tensor(0.1619, grad_fn=<DivBackward0>)
tensor(0.1835, grad_fn=<DivBackward0>)
tensor(0.2013, grad_fn=<DivBackward0>)
tensor(0.1784, grad_fn=<DivBackward0>)
tensor(0.1805, grad_fn=<DivBackward0>)
tensor(0.1624, grad_fn=<DivBackward0>)
tensor(0.1773, grad_fn=<DivBackward0>)
tensor(0.1839, grad_fn=<DivBackward0>)
tensor(0.1797, grad_fn=<DivBackward0>)
tensor(0.1745, grad_fn=<DivBackward0>)
tensor(0.1901, grad_fn=<DivBackward0>)
tensor(0.1658, grad_fn=<DivBackward0>)
tensor(0.1688, grad_fn=<DivBackward0>)
tensor(0.1681, grad_fn=<DivBackward0>)
tensor(0.1857, grad_fn=<DivBackward0>)
tensor(0.1821, grad_fn=<DivBackward0>)
tensor(0.1936, grad_fn=<DivBackward0>)
tensor(0.1817, grad_fn=<DivBackward0>)
tensor(0.1565, grad_fn=<DivBackward0>)
tensor(0.1748, grad_fn=<DivBackward0>)
tensor(0.1710, grad_fn=<DivBackward0>)
tensor(0.1846, grad_fn=<DivBackward0>)
Train Epoch: 1 [352/7146 (5%)] Loss: 0.184633 Train Acc: 0.062500
1 Like

Did you change anything else besides the device and do you have a reproducible code snippet?

1 Like

No just changed the device to CPU.
I’ll try and create a snippet to reproduce the error.

1 Like

I downgraded pytorch version to 1.1.0 and it started working again on GPU.

Good to hear it’s working now, but I’m a bit concerned that you might have discovered a regression in the newer version(s).
Could you try to update to 1.3.0 and run the code again, please?

Sure. I’ll do this by tomorrow.

It works absolutely fine on 1.3.0.

Thanks for checking!