RuntimeError: Function 'CudnnBatchNormBackward' returned nan values in its 0th output

Rhinigtas_Salvex · November 9, 2019, 10:27pm

I try to train a convolutional variational auto encoder on greyscale images but getting poor/no results.
I use a discriminator to try to force the latent space representation into a sensible distribution.

After analyzing the gradients I found that they become none after few batches.
So I tried debugging with torch.autograd.set_detect_anomaly(True)
And I get the Error in the Title, it is triggered when calling loss.backwards()

I checked the batch of inputs and there are no none values, also my batch size is greater than 1 so there shouldn’t be a problem with the normalization of the batches.

My model architecture is:

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
            Conv2d-1           [4, 8, 254, 254]              80
       BatchNorm2d-2           [4, 8, 254, 254]              16
              ReLU-3           [4, 8, 254, 254]               0
            Conv2d-4           [4, 8, 127, 127]             584
       BatchNorm2d-5           [4, 8, 127, 127]              16
              ReLU-6           [4, 8, 127, 127]               0
            Conv2d-7           [4, 8, 127, 127]             584
         ConvBlock-8           [4, 8, 127, 127]               0
            Conv2d-9           [4, 8, 127, 127]              64
     ResDownBlock-10           [4, 8, 127, 127]               0
      BatchNorm2d-11           [4, 8, 127, 127]              16
             ReLU-12           [4, 8, 127, 127]               0
           Conv2d-13            [4, 16, 64, 64]           1,168
      BatchNorm2d-14            [4, 16, 64, 64]              32
             ReLU-15            [4, 16, 64, 64]               0
           Conv2d-16            [4, 16, 64, 64]           2,320
        ConvBlock-17            [4, 16, 64, 64]               0
           Conv2d-18            [4, 16, 64, 64]             128
     ResDownBlock-19            [4, 16, 64, 64]               0
           Linear-20                   [4, 200]      13,107,400
           Linear-21                   [4, 200]      13,107,400
================================================================
Total params: 26,219,808
Trainable params: 26,219,808
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 1.00
Forward/backward pass size (MB): 96.70
Params size (MB): 100.02
Estimated Total Size (MB): 197.73
----------------------------------------------------------------
----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
            Linear-1                 [4, 65536]      13,172,736
       BatchNorm2d-2          [4, 16, 128, 128]              32
              ReLU-3          [4, 16, 128, 128]               0
            Conv2d-4           [4, 8, 128, 128]           1,160
       BatchNorm2d-5           [4, 8, 128, 128]              16
              ReLU-6           [4, 8, 128, 128]               0
            Conv2d-7           [4, 8, 128, 128]             584
         ConvBlock-8           [4, 8, 128, 128]               0
            Conv2d-9           [4, 8, 128, 128]             128
       ResUpBlock-10           [4, 8, 128, 128]               0
      BatchNorm2d-11           [4, 8, 256, 256]              16
             ReLU-12           [4, 8, 256, 256]               0
           Conv2d-13           [4, 1, 256, 256]              73
      BatchNorm2d-14           [4, 1, 256, 256]               2
             ReLU-15           [4, 1, 256, 256]               0
           Conv2d-16           [4, 1, 256, 256]              10
        ConvBlock-17           [4, 1, 256, 256]               0
           Conv2d-18           [4, 1, 256, 256]               8
       ResUpBlock-19           [4, 1, 256, 256]               0
          Sigmoid-20           [4, 1, 256, 256]               0
================================================================
Total params: 13,174,765
Trainable params: 13,174,765
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.00
Forward/backward pass size (MB): 94.00
Params size (MB): 50.26
Estimated Total Size (MB): 144.26
----------------------------------------------------------------
----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
       BatchNorm1d-1                   [4, 200]             400
              ReLU-2                   [4, 200]               0
            Linear-3                   [4, 128]          25,728
       LinearBlock-4                   [4, 128]               0
       BatchNorm1d-5                   [4, 128]             256
              ReLU-6                   [4, 128]               0
            Linear-7                     [4, 1]             129
       LinearBlock-8                     [4, 1]               0
           Sigmoid-9                     [4, 1]               0
================================================================
Total params: 26,513
Trainable params: 26,513
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.00
Forward/backward pass size (MB): 0.03
Params size (MB): 0.10
Estimated Total Size (MB): 0.13
----------------------------------------------------------------

albanD · November 11, 2019, 10:02pm

Hi,

I think there is a confusion in your question. The error states nan, not none.
Also nan can appear in batchnorm when all the values are the same, and thus std = 0.

Rhinigtas_Salvex · November 12, 2019, 7:26pm

Yes, nan is what I meant. What can I do about the input values being the same?
The data loader is set to shuffle and I would like to keep it that way.

albanD · November 12, 2019, 8:20pm

Well the batchnorm layer does not really make sense if all inputs are the same. Because beyond the fact that you get nan gradients, all the values that you forward are 0s. Is that expected in your case?
Also if all these values are 0s, that mean that all the gradients flowing back will be 0s as well. So you won’t be able to learn anything runs before the batchnorm.
If that’s what you want, you can simply set everything before the batchnorm not to require gradients.

Rhinigtas_Salvex · November 12, 2019, 8:56pm

The inputs shouldn’t be all the same, they are images of shoe sole impressions, which are mostly white with a bit of black where the shoe sole came in contact the ground.
For the first few batches, most of the time just 1-2 but sometimes 3 or 4, it seems to work as intended.
So maybe there is a problem with the Dataloader? But how can I debug this except from running through it step by step? Is there a way to get the current statistics of the batchnorm layers?

albanD · November 12, 2019, 10:36pm

In your forward function, you can add extra prints to check the values of the current batch (potentially with if checks to only print for problematic batches).
You can check the max difference between the different samples in your batch for example.
You can also save the batch on disk at every iteration. That way, when it crashes, you can load the last saved batch to inspect the values for it.

Mohammed_Hassoubah · May 27, 2020, 3:40pm

I’m also trying to train a VAE, and I get the same issue
for me it happened that an element of the KL cost equal inf !! that’s when I get this error. Still haven’t found a solution!

albanD · May 27, 2020, 3:42pm

You will need to make sure the KL does not goes to infinity to remove this error.

isJunCheng · September 15, 2021, 9:24am

Hello, has your problem been solved? I have a similar problem…very urgent

Rhinigtas_Salvex · September 21, 2021, 6:02pm

Sorry, my error was due to a dumb mishap on my side. I misconfigured the network structure where multiple conv layers came after each other without any activation function.