Training loss nan is produced on GPU but not on CPU for CNN

luiejunior · November 17, 2021, 2:20pm

I’m trying to run the train.py file from GitHub - gtamba/pytorch-slim-cnn: A pytorch implementation of SlimCNN for Facial Attribute Classification : https://arxiv.org/abs/1907.02157. When I run it on my GPU, I get nan for all of the loss values in the training values. However, when I run it on CPU, I don’t get any nan values for the loss.

When I add the line torch.autograd.set_detect_anomaly(True), it comes up with the error message ‘RuntimeError: Function ‘BinaryCrossEntropyWithLogitsBackward0’ returned nan values in its 0th output.’

What seems to be the issue here, and how can I fix it? I have a NVIDIA RTX 3070 GPU and am on CUDA version 11.4. I’m using torch version 1.10.0. Let me know if there’s any other information I can provide to diagnose the problem.

Thanks!

illtellyoulater · March 11, 2022, 2:44pm

Hello,
did you understand what was causing this problem?
I’m seeing the same issue on a GTX 1660 TI gpu, but it automagically disappears using a GTX 1050.
Any help would be appreciated.
Thank you.