NaN weights on Conv2d layer, RuntimeError: Function 'CudnnBatchNormBackward0' returned nan values in its 0th output

elsharko · March 31, 2022, 9:57am

Hi there,

I’m currently implementing a model containing two VAEs, one running on MNIST and the other one on SVHN.
During training i’m facing a KLD loss turning into NaN value after some iterations. I tried to lower the learning rate, which seemed to be successful at first glance, but i now face the same situation after some epochs.

After some investigation, i observed this comes from the weights of VAEs encoder layers becoming NaN at some point. I enabled torch.autograd.set_detect_anomaly(True) at the beginning of the script, which resulted in this error:

RuntimeError: Function 'CudnnBatchNormBackward0' returned nan values in its 0th output.

This seems to be triggered by batch_norm in my VAEs encoder, but I do not understand what causes this error (I have no NaN values in input as these are MNIST and SVHN classical datasets).

Thank you in advance for your answer!

edit: I also added gradient clipping, in vain. Plus, the error is unrelated to cuda-toolkit kernel version: it happens both with versons 10.2.89 and 11.3.1.

ptrblck · April 1, 2022, 6:01am

If your loss diverges to NaN values, the gradients will also be computed as NaNs and anomaly detection will then trigger the error in a layer which just returns them.
You would have to make sure your training doesn’t diverge and the loss contains valid values.

elsharko · April 1, 2022, 8:57am

Hi @ptrblck, thank your for your answer.

That’s the thing: when I inspect the behaviour of the training, the error is thrown before the loss goes NaN. I entered in Pdb mode at the moment the error is thrown (with a try: loss.backward() except: pdb.set_trace()), and all model parameters, grads and loss terms are well defined…

The whole Traceback i get is :

/gpfs/users/pellegrainv/.conda/envs/a100/lib/python3.9/site-packages/torch/autograd/__init__.py:173: UserWarning: Error detected in CudnnBatchNormBackward0. Traceback of forward call that caused the error:
  File "/gpfs/workdir/pellegrainv/packages/dmvae/train_dmvae.py", line 265, in <module>
    model = model_pipeline(args)
  File "/gpfs/workdir/pellegrainv/packages/dmvae/train_dmvae.py", line 21, in model_pipeline
    train_losses, train_epoch_losses, test_losses = train_epochs(model, train_loader, test_loader, optimizer, config,
  File "/gpfs/workdir/pellegrainv/packages/dmvae/train_dmvae.py", line 172, in train_epochs
    train_loss, train_epoch_loss = train(model, train_loader, optimizer, epoch, config, quiet)
  File "/gpfs/workdir/pellegrainv/packages/dmvae/train_dmvae.py", line 63, in train
    recons, mu, log_var = model(imagesA, imagesB)
  File "/gpfs/users/pellegrainv/.conda/envs/a100/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1128, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/gpfs/workdir/pellegrainv/packages/dmvae/models/dmvae.py", line 25, in forward
    mu_prA, log_var_prA = self.vaeA.encode(inputA)
  File "/gpfs/workdir/pellegrainv/packages/dmvae/models/vanilla_vae.py", line 139, in encode
    result = self.encoder(input)
  File "/gpfs/users/pellegrainv/.conda/envs/a100/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/gpfs/users/pellegrainv/.conda/envs/a100/lib/python3.9/site-packages/torch/nn/modules/container.py", line 141, in forward
    input = module(input)
  File "/gpfs/users/pellegrainv/.conda/envs/a100/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/gpfs/users/pellegrainv/.conda/envs/a100/lib/python3.9/site-packages/torch/nn/modules/container.py", line 141, in forward
    input = module(input)
  File "/gpfs/users/pellegrainv/.conda/envs/a100/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/gpfs/users/pellegrainv/.conda/envs/a100/lib/python3.9/site-packages/torch/nn/modules/batchnorm.py", line 168, in forward
    return F.batch_norm(
  File "/gpfs/users/pellegrainv/.conda/envs/a100/lib/python3.9/site-packages/torch/nn/functional.py", line 2421, in batch_norm
    return torch.batch_norm(
 (Triggered internally at  /opt/conda/conda-bld/pytorch_1646756402876/work/torch/csrc/autograd/python_anomaly_mode.cpp:104.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass

Thank you in advance for your time.

ptrblck · April 1, 2022, 9:04am

That’s interesting. Could you store the inputs to the batchnorm layer as well as its state and try to chck what the output was?

elsharko · April 1, 2022, 12:37pm

I uploaded the log file (pickle) containing : input, output, weight, bias, running mean and running var for all of the 5 batchnorm2d layers in my VAE encoder. This can be downloaded here:
https://drive.google.com/file/d/1vfv26QwMeqnfA6MEme4EnbkBYCJFfebq/view?usp=sharing

I don’t know if it’s convenient for you, but it was too big to be displayed here. On my side, i checked all min and max values for these different parameters/inputs/outputs which seemed to be OK.

Thanks.

Edit: I also stored the grad_input corresponding to these 5 batchnorm2d layers. At the time the error is thrown, I can observe that the grad_inputs corresponding to the last batchnorm2d layer are all at -inf value.