I am getting NaNs for loss after some time only on GPU training, the same code(with the same config) was working till last month. The cuda driver and PyTorch were updated during this period.
The loss increases exponentially with each step only on GPU.
Train Epoch: 1 [0/7146 (0%)] Loss: 0.176097 Train Acc: 0.187500
tensor(0.1642, device='cuda:0', grad_fn=<DivBackward0>)
tensor(8.2784, device='cuda:0', grad_fn=<DivBackward0>)
tensor(21.9132, device='cuda:0', grad_fn=<DivBackward0>)
tensor(51.5612, device='cuda:0', grad_fn=<DivBackward0>)
tensor(415.4600, device='cuda:0', grad_fn=<DivBackward0>)
tensor(17203.3320, device='cuda:0', grad_fn=<DivBackward0>)
tensor(nan, device='cuda:0', grad_fn=<DivBackward0>)
When I used autograd anomaly detection this was the output:
RuntimeError: Function 'CdistBackward' returned nan values in its 0th output.
I ran the training on CPU for 15 hours and it ran perfectly.
Train Epoch: 1 [0/7146 (0%)] Loss: 0.172225 Train Acc: 0.062500
tensor(0.1619, grad_fn=<DivBackward0>)
tensor(0.1835, grad_fn=<DivBackward0>)
tensor(0.2013, grad_fn=<DivBackward0>)
tensor(0.1784, grad_fn=<DivBackward0>)
tensor(0.1805, grad_fn=<DivBackward0>)
tensor(0.1624, grad_fn=<DivBackward0>)
tensor(0.1773, grad_fn=<DivBackward0>)
tensor(0.1839, grad_fn=<DivBackward0>)
tensor(0.1797, grad_fn=<DivBackward0>)
tensor(0.1745, grad_fn=<DivBackward0>)
tensor(0.1901, grad_fn=<DivBackward0>)
tensor(0.1658, grad_fn=<DivBackward0>)
tensor(0.1688, grad_fn=<DivBackward0>)
tensor(0.1681, grad_fn=<DivBackward0>)
tensor(0.1857, grad_fn=<DivBackward0>)
tensor(0.1821, grad_fn=<DivBackward0>)
tensor(0.1936, grad_fn=<DivBackward0>)
tensor(0.1817, grad_fn=<DivBackward0>)
tensor(0.1565, grad_fn=<DivBackward0>)
tensor(0.1748, grad_fn=<DivBackward0>)
tensor(0.1710, grad_fn=<DivBackward0>)
tensor(0.1846, grad_fn=<DivBackward0>)
Train Epoch: 1 [352/7146 (5%)] Loss: 0.184633 Train Acc: 0.062500
.....