RuntimeError: Function 'CudnnBatchNormBackward' returned nan values in its 0th output in 5-fold cross-validation

maferoaloaiza · November 16, 2020, 4:42pm

Hi, I am currently working on a CT scan classification dataset with 5-fold cross-validation. The images go through an online preprocessing stage before entering the backbone to reduce their size, so the network fits in the GPU. When I train the network from scratch, there are no issues. However, when I use pre-trained weights to make finetuning in the backbone, I have an error in the second fold. I am currently using PyTorch 1.6 with amp. The first fold trains without any problem, but the second one cannot. I have tried changing the fold that trains first, but the second one, no matter what fold it is, shows this error:

Traceback (most recent call last):
File “main.py”, line 399, in
main()
File “main.py”, line 190, in main
bce, scaler)
File “main.py”, line 301, in train
scaler.scale(loss).backward()
File “/home/mfroa/anaconda3/envs/lung/lib/python3.7/site-packages/torch/tensor.py”, line 184, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File “/home/mfroa/anaconda3/envs/lung/lib/python3.7/site-packages/torch/autograd/init.py”, line 115, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: Function ‘CudnnBatchNormBackward’ returned nan values in its 0th output.

ptrblck · November 17, 2020, 11:20am

Are you using anomaly detection in your code?
If so, could you disable it if you are using amp, since NaN gradients are expected to occur sometimes (especially during the first iterations before the GradScaler stabilizes the scaling factor)?

maferoaloaiza · November 19, 2020, 6:42pm

Hi, I was using anomaly detection. I deactivated it, and it seems to work just fine. Thank you for your help. I was wondering, when is it recommended to use anomaly detection?

ptrblck · November 20, 2020, 12:12am

You can use it in case you are seeing invalid outputs or gradients, so could this be an issue for your use case?
I mentioned mixed-precision training (torch.cuda.amp), since gradients might overflow and the GradScaler would skip these updates. These iterations might however trigger the anomaly detection, so you should activate anomaly detection once the loss scaling factor is stable.

If you are not using torch.cuda.amp, then something might still be broken in your code and I would like to debug it in this case.

maferoaloaiza · November 23, 2020, 7:55pm

Thank you for your answer. I am using the torch.cuda.amp, so I guess the error is related to the first iterations, as you mention. I haven’t seen invalid outputs or gradients, so I think this isn’t an issue neither.