I’ve been working on this project with a collaborator lately and we’ve been trying to train a large Unet model (~800k params). I noticed that when I tried to train the model on my GPU I got a nan loss. After sifting through possible issues, I came across that my activations started off as well distributed normalized numbers and eventually an upsampling followed by a 2D convolution caused some of the activations to become nan. These eventually cascaded down the rest of the model and turned the entire output, gradients, and loss into nan. This didn’t occur on my CPU however.
I managed to distill the error to just this:
>>>x = torch.rand((32,64,16,16)).to(torch.float32).to("cpu")
>>>
>>>conv = nn.Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)).to(torch.float32).to("cpu")
>>>
>>>with torch.no_grad():
>>> output = conv(x)
>>>print(output.isnan().any())
tensor(False, device='cuda:0')
>>>x = torch.rand((32,64,16,16)).to(torch.float32).to("cuda")
>>>
>>>conv = nn.Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)).to(torch.float32).to("cuda")
>>>
>>>with torch.no_grad():
>>> output = conv(x)
>>>print(output.isnan().any())
tensor(True, device='cuda:0')
Where the output tensor from the GPU has a shape of torch.Size([32, 64, 16, 16]) looks like this:
tensor([[[[ nan, nan, 0.0310, ..., nan, 0.0310, 0.0310],
[ nan, 0.0310, nan, ..., 0.0310, nan, 0.0310],
[ nan, 0.0310, 0.0310, ..., 0.0310, nan, 0.0310],
...,
[ nan, 0.0310, nan, ..., nan, nan, 0.0310],
[ nan, 0.0310, 0.0310, ..., 0.0310, nan, 0.0310],
[ 0.0310, nan, 0.0310, ..., nan, 0.0310, 0.0310]],
[[-0.0250, -0.0250, -0.0250, ..., -0.0250, -0.0250, -0.0250],
[-0.0250, -0.0250, -0.0250, ..., -0.0250, -0.0250, -0.0250],
[-0.0250, -0.0250, -0.0250, ..., -0.0250, -0.0250, -0.0250],
...,
[-0.0250, -0.0250, -0.0250, ..., -0.0250, -0.0250, -0.0250],
[-0.0250, -0.0250, -0.0250, ..., -0.0250, -0.0250, -0.0250],
[-0.0250, -0.0250, -0.0250, ..., -0.0250, -0.0250, -0.0250]],
[[ nan, nan, 0.0058, ..., 0.0058, nan, nan],
[ nan, nan, nan, ..., nan, nan, nan],
[ nan, nan, nan, ..., nan, nan, nan],
...,
[ 0.0058, 0.0058, 0.0058, ..., nan, 0.0058, nan],
[ nan, nan, 0.0058, ..., 0.0058, nan, nan],
[ nan, nan, nan, ..., nan, nan, nan]],
...,
...
...,
[ 0.0051, 0.0051, 0.0051, ..., 0.0051, 0.0051, 0.0051],
[ 0.0051, 0.0051, 0.0051, ..., 0.0051, 0.0051, 0.0051],
[ 0.0051, 0.0051, 0.0051, ..., 0.0051, 0.0051, 0.0051]]]],
device='cuda:0', dtype=torch.float16)
To me it seems there are a surprising amount of repeated numbers and nan.
My GPU is an NVIDIA GeForce GTX 1660 Ti and I’m running this through WSL2. I’m using pytorch 11.7 and CUDA 11.5.
If anyone has any ideas what might be causing this issue and how to resolve it, I’d love to hear whatever ideas you guys have to say. I’m still relatively new deep learning so this is turning out to be quite the challenge for me.