nn.Conv2d produces NaNs when specific value of out_channels is set

I am trying to implement fp16 training for a VQ-GAN. The issue is: NaNs appear in the first batch when using gradient scaler. I tracked down the issue to a specific Conv2d layer. I also saved the input coming to it which can be downloaded here:

Code to reproduce NaNs:

from torch.nn import GroupNorm, SiLU, Conv2d

inf = torch.load("inf.pt").cuda()


gn1 = GroupNorm(16, 128, eps=1e-05, affine=True).cuda().half()
s = SiLU()
# out_channels: 64 -> 73 NaN
conv1 = Conv2d(128, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)).cuda().half()

o_gn1 = gn1(inf)
o_s = s(o_gn1)
conv1_out = conv1(o_s)
conv1_out.min(), conv1_out.max()

(tensor(nan, device='cuda:0', dtype=torch.float16, grad_fn=<MinBackward1>),
 tensor(nan, device='cuda:0', dtype=torch.float16, grad_fn=<MaxBackward1>))

The weird part is that when I set out_channels to anything outside the [64, 73] range, Nans suddenly disappear. Also when I swich back to fp32 the problem does not occur. Am I not seeing something important? I would appreciate any help, thanks.

I run experiments on:
CUDA: cuda_11.7
Torch: 2.0.1

What’s the range of the input as well as the outputs as I guess it could overflow the valid FP16 range?

That is what I initially thought, but the variance in autoencoder is well-behaved.
Input to conv1 after SiLU:

min: -0.2785
max: 4.5039
std: 0.5737
NaNs: 0

Now the stats of conv1.weight:

min: -0.02946
max: 0.02946
std: 0.01705

If I increase/decrease/change the range of conv1.weight it still produces NaNs, but only if out_channels is in the range of [64, 73].

Are you seeing the same issue after updating PyTorch to the latest stable or nightly release?

Oh, that is weird, when I switch to newer version of PyTorch on Google Colab, training continues without any issues, thanks!

Great! Thanks for checking your code with a newer release and confirming it’s working now.

1 Like