Warning or error?

Marat · August 12, 2020, 10:04am

I am trying to train wgan_gp and getting strange behavior where at first I got this on GP penalty calculation (I am using pytorch 1.5.0)

[Epoch 3/5000] [Batch 25/163] [D loss: -71.335205] [G loss: 94.295631]
[Epoch 3/5000] [Batch 30/163] [D loss: -64.697197] [G loss: 195.611176]
[Epoch 3/5000] [Batch 35/163] [D loss: -52.765976] [G loss: 182.699905]
[Epoch 3/5000] [Batch 40/163] [D loss: -59.642242] [G loss: 242.636047]
[Epoch 3/5000] [Batch 45/163] [D loss: -65.882965] [G loss: 195.031784]Warning: Error detected in CudnnConvolutionBackward. Traceback of forward call that caused the error:
  File "wgan_birka.py", line 278, in <module>
    gradient_penalty = compute_gradient_penalty(discriminator, real_imgs.data, fake_imgs.data)
  File "wgan_birka.py", line 207, in compute_gradient_penalty
    d_interpolates, _ = D(interpolates)
  File "/home/marat/anaconda3/envs/server/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "wgan_birka.py", line 144, in forward
    x = self.block7(x)
  File "/home/marat/anaconda3/envs/server/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/marat/anaconda3/envs/server/lib/python3.7/site-packages/torch/nn/modules/container.py", line 100, in forward
    input = module(input)
  File "/home/marat/anaconda3/envs/server/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/marat/anaconda3/envs/server/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 349, in forward
    return self._conv_forward(input, self.weight)
  File "/home/marat/anaconda3/envs/server/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 346, in _conv_forward
    self.padding, self.dilation, self.groups)
 (print_stack at /opt/conda/conda-bld/pytorch_1587428266983/work/torch/csrc/autograd/python_anomaly_mode.cpp:60)

[Epoch 3/5000] [Batch 50/163] [D loss: -49.171650] [G loss: 111.772812]

And only huge amout of batches latter got fatal error which stops calculations (anomaly detection is on)

[Epoch 4/5000] [Batch 110/163] [D loss: -33.384842] [G loss: -102.455353]
[Epoch 4/5000] [Batch 115/163] [D loss: -15.646807] [G loss: -69.029060]
[Epoch 4/5000] [Batch 120/163] [D loss: -21.404884] [G loss: 4.358668]
Traceback (most recent call last):
  File "wgan_birka.py", line 278, in <module>
    gradient_penalty = compute_gradient_penalty(discriminator, real_imgs.data, fake_imgs.data)
  File "wgan_birka.py", line 217, in compute_gradient_penalty
    only_inputs=True,
  File "/home/marat/anaconda3/envs/server/lib/python3.7/site-packages/torch/autograd/__init__.py", line 158, in grad
    inputs, allow_unused)
RuntimeError: Function 'CudnnConvolutionBackward' returned nan values in its 1th output.

Note I am using torch.autograd.set_detect_anomaly(True) and num_workers=0 so it should not be weird parallelism effect.

I saved batch on fatal (last) crush but in generator network there was already sit NaN, and again by some reason pytroch did not crush on “warning” but some amout of batches later so I cant reproduce exactly batch which lead to store NaN in weights of generator.

PS I will try to check all parametrs at each weight update and stop execution if NaN is sitting where.

UPDATE:

I found moment when generator obtaned some NaN in its parameters. Surprisingly gradient of the parameters did not have any NaNs but the Adam optimizer set some generator weight to NaN.

albanD · August 12, 2020, 2:06pm

Hi,

This is quite unexpected
Are you sure that you don’t have a try: except: in python that eats up the error corresponding to the anomaly mode error?
Also you can try to force python to throw error on warning, but since you seem to be eating up the error, the warning’s error might be as well…

Marat · August 12, 2020, 4:34pm

I did not add any try catch in my 1 file code. I also found that NaN appears in generator weights and it is set by Adam optimizer which is uses NON NaN gradients. Also gradient norm clipping which I used did not help. I found that the only thing I need to reproduce the issue is just optimizer itself I do not need original batch or even generator weights. I will try make minimalistic example to reproduce my issue, it is seems that at some point Adam gets crazy.

albanD · August 12, 2020, 4:36pm

That would be great if you have a small code sample that reproduces the issue!

Marat · August 12, 2020, 4:50pm

Recently I found that in my case Adam already has negative value in state['exp_avg_sq'] so this is explains why do I need only optimizer state loaded to reproduce the issue because

exp_avg_sq.mul_(beta2).addcmul_(grad, grad, value=1 - beta2) - which does not help

and

denom = (exp_avg_sq.sqrt() / math.sqrt(bias_correction2)).add_(group['eps'])

Which is immediately produces NaN value

See adam.py file

Why did Adam accept invalid values in its state is a good question… So now I need to track its state to discover moment when it is get negative value in state['exp_avg_sq'] …