Conv2d output is NaN on GPU, but not on CPU

I’ve been working on this project with a collaborator lately and we’ve been trying to train a large Unet model (~800k params). I noticed that when I tried to train the model on my GPU I got a nan loss. After sifting through possible issues, I came across that my activations started off as well distributed normalized numbers and eventually an upsampling followed by a 2D convolution caused some of the activations to become nan. These eventually cascaded down the rest of the model and turned the entire output, gradients, and loss into nan. This didn’t occur on my CPU however.

I managed to distill the error to just this:

>>>x = torch.rand((32,64,16,16)).to(torch.float32).to("cpu")
>>>
>>>conv = nn.Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)).to(torch.float32).to("cpu")
>>>
>>>with torch.no_grad():
>>>    output = conv(x)
>>>print(output.isnan().any())
tensor(False, device='cuda:0')
>>>x = torch.rand((32,64,16,16)).to(torch.float32).to("cuda")
>>>
>>>conv = nn.Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)).to(torch.float32).to("cuda")
>>>
>>>with torch.no_grad():
>>>    output = conv(x)
>>>print(output.isnan().any())
tensor(True, device='cuda:0')

Where the output tensor from the GPU has a shape of torch.Size([32, 64, 16, 16]) looks like this:

tensor([[[[    nan,     nan,  0.0310,  ...,     nan,  0.0310,  0.0310],
          [    nan,  0.0310,     nan,  ...,  0.0310,     nan,  0.0310],
          [    nan,  0.0310,  0.0310,  ...,  0.0310,     nan,  0.0310],
          ...,
          [    nan,  0.0310,     nan,  ...,     nan,     nan,  0.0310],
          [    nan,  0.0310,  0.0310,  ...,  0.0310,     nan,  0.0310],
          [ 0.0310,     nan,  0.0310,  ...,     nan,  0.0310,  0.0310]],

         [[-0.0250, -0.0250, -0.0250,  ..., -0.0250, -0.0250, -0.0250],
          [-0.0250, -0.0250, -0.0250,  ..., -0.0250, -0.0250, -0.0250],
          [-0.0250, -0.0250, -0.0250,  ..., -0.0250, -0.0250, -0.0250],
          ...,
          [-0.0250, -0.0250, -0.0250,  ..., -0.0250, -0.0250, -0.0250],
          [-0.0250, -0.0250, -0.0250,  ..., -0.0250, -0.0250, -0.0250],
          [-0.0250, -0.0250, -0.0250,  ..., -0.0250, -0.0250, -0.0250]],

         [[    nan,     nan,  0.0058,  ...,  0.0058,     nan,     nan],
          [    nan,     nan,     nan,  ...,     nan,     nan,     nan],
          [    nan,     nan,     nan,  ...,     nan,     nan,     nan],
          ...,
          [ 0.0058,  0.0058,  0.0058,  ...,     nan,  0.0058,     nan],
          [    nan,     nan,  0.0058,  ...,  0.0058,     nan,     nan],
          [    nan,     nan,     nan,  ...,     nan,     nan,     nan]],

         ...,
...
          ...,
          [ 0.0051,  0.0051,  0.0051,  ...,  0.0051,  0.0051,  0.0051],
          [ 0.0051,  0.0051,  0.0051,  ...,  0.0051,  0.0051,  0.0051],
          [ 0.0051,  0.0051,  0.0051,  ...,  0.0051,  0.0051,  0.0051]]]],
       device='cuda:0', dtype=torch.float16)

To me it seems there are a surprising amount of repeated numbers and nan.

My GPU is an NVIDIA GeForce GTX 1660 Ti and I’m running this through WSL2. I’m using pytorch 11.7 and CUDA 11.5.

If anyone has any ideas what might be causing this issue and how to resolve it, I’d love to hear whatever ideas you guys have to say. I’m still relatively new deep learning so this is turning out to be quite the challenge for me.

PyTorch 11.7 is not a thing so could you check which version you have installed? In case it’s an older one, please update to the latest release and rerun your code.

You’re right, I made a mistake. I checked again and I’m using pytorch version 2.0.1. Which I believe expects CUDA version 11.7 (since torch.version.cuda outputs 11.7). And when I ran nvcc --version in my terminal it showed I had CUDA version 11.5. I tried upgrading my CUDA version to 11.7 and the error still continued.

When I upgraded to torch version 2.4.0 (lastest at time of posting), the small test I ran doesn’t seem to have any nan values.

>>>x = torch.rand((32,64,16,16)).to(torch.float32).to("cuda")
>>>
>>>conv = nn.Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)).to(torch.float32).to("cuda")
>>>
>>>with torch.no_grad():
>>>    output = conv(x)
>>>print(output.isnan().any())
tensor(False, device='cuda:0')

Although, I am dependant on a package to train my model which is incompatible with torch version 2.4.0.

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
miniai 0.0.1 requires torch<2.1,>=1.7, but you have torch 2.4.0 which is incompatible.

Do you think there are any other possible solutions?

Thanks for confirming no issues are seen in the latest PyTorch release.
Do you know why torch<2.1 is needed and what exactly the 3rd party library does not support in newer PyTorch releases?

Im not exactly sure why torch<2.1 is required for the 3rd party library. I changed the requirements for the library to be torch<=2.4 and I needed to modify some extra things (adding TrainCB() to the Learner cbs). But yeah, seems to work fine. I appreciate you taking the time to help :slight_smile: