NaN weights on Conv2d layer, RuntimeError: Function 'CudnnBatchNormBackward0' returned nan values in its 0th output

Hi there,

I’m currently implementing a model containing two VAEs, one running on MNIST and the other one on SVHN.
During training i’m facing a KLD loss turning into NaN value after some iterations. I tried to lower the learning rate, which seemed to be successful at first glance, but i now face the same situation after some epochs.

After some investigation, i observed this comes from the weights of VAEs encoder layers becoming NaN at some point. I enabled torch.autograd.set_detect_anomaly(True) at the beginning of the script, which resulted in this error:

RuntimeError: Function 'CudnnBatchNormBackward0' returned nan values in its 0th output.

This seems to be triggered by batch_norm in my VAEs encoder, but I do not understand what causes this error (I have no NaN values in input as these are MNIST and SVHN classical datasets).

Thank you in advance for your answer!

edit: I also added gradient clipping, in vain. Plus, the error is unrelated to cuda-toolkit kernel version: it happens both with versons 10.2.89 and 11.3.1.

If your loss diverges to NaN values, the gradients will also be computed as NaNs and anomaly detection will then trigger the error in a layer which just returns them.
You would have to make sure your training doesn’t diverge and the loss contains valid values.

Hi @ptrblck, thank your for your answer.

That’s the thing: when I inspect the behaviour of the training, the error is thrown before the loss goes NaN. I entered in Pdb mode at the moment the error is thrown (with a try: loss.backward() except: pdb.set_trace()), and all model parameters, grads and loss terms are well defined…

The whole Traceback i get is :

/gpfs/users/pellegrainv/.conda/envs/a100/lib/python3.9/site-packages/torch/autograd/__init__.py:173: UserWarning: Error detected in CudnnBatchNormBackward0. Traceback of forward call that caused the error:
  File "/gpfs/workdir/pellegrainv/packages/dmvae/train_dmvae.py", line 265, in <module>
    model = model_pipeline(args)
  File "/gpfs/workdir/pellegrainv/packages/dmvae/train_dmvae.py", line 21, in model_pipeline
    train_losses, train_epoch_losses, test_losses = train_epochs(model, train_loader, test_loader, optimizer, config,
  File "/gpfs/workdir/pellegrainv/packages/dmvae/train_dmvae.py", line 172, in train_epochs
    train_loss, train_epoch_loss = train(model, train_loader, optimizer, epoch, config, quiet)
  File "/gpfs/workdir/pellegrainv/packages/dmvae/train_dmvae.py", line 63, in train
    recons, mu, log_var = model(imagesA, imagesB)
  File "/gpfs/users/pellegrainv/.conda/envs/a100/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1128, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/gpfs/workdir/pellegrainv/packages/dmvae/models/dmvae.py", line 25, in forward
    mu_prA, log_var_prA = self.vaeA.encode(inputA)
  File "/gpfs/workdir/pellegrainv/packages/dmvae/models/vanilla_vae.py", line 139, in encode
    result = self.encoder(input)
  File "/gpfs/users/pellegrainv/.conda/envs/a100/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/gpfs/users/pellegrainv/.conda/envs/a100/lib/python3.9/site-packages/torch/nn/modules/container.py", line 141, in forward
    input = module(input)
  File "/gpfs/users/pellegrainv/.conda/envs/a100/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/gpfs/users/pellegrainv/.conda/envs/a100/lib/python3.9/site-packages/torch/nn/modules/container.py", line 141, in forward
    input = module(input)
  File "/gpfs/users/pellegrainv/.conda/envs/a100/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/gpfs/users/pellegrainv/.conda/envs/a100/lib/python3.9/site-packages/torch/nn/modules/batchnorm.py", line 168, in forward
    return F.batch_norm(
  File "/gpfs/users/pellegrainv/.conda/envs/a100/lib/python3.9/site-packages/torch/nn/functional.py", line 2421, in batch_norm
    return torch.batch_norm(
 (Triggered internally at  /opt/conda/conda-bld/pytorch_1646756402876/work/torch/csrc/autograd/python_anomaly_mode.cpp:104.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass

Thank you in advance for your time.

That’s interesting. Could you store the inputs to the batchnorm layer as well as its state and try to chck what the output was?

I uploaded the log file (pickle) containing : input, output, weight, bias, running mean and running var for all of the 5 batchnorm2d layers in my VAE encoder. This can be downloaded here:
https://drive.google.com/file/d/1vfv26QwMeqnfA6MEme4EnbkBYCJFfebq/view?usp=sharing

I don’t know if it’s convenient for you, but it was too big to be displayed here. On my side, i checked all min and max values for these different parameters/inputs/outputs which seemed to be OK.

Thanks.

Edit: I also stored the grad_input corresponding to these 5 batchnorm2d layers. At the time the error is thrown, I can observe that the grad_inputs corresponding to the last batchnorm2d layer are all at -inf value.