I am trying to train wgan_gp and getting strange behavior where at first I got this on GP penalty calculation (I am using pytorch 1.5.0)
[Epoch 3/5000] [Batch 25/163] [D loss: -71.335205] [G loss: 94.295631]
[Epoch 3/5000] [Batch 30/163] [D loss: -64.697197] [G loss: 195.611176]
[Epoch 3/5000] [Batch 35/163] [D loss: -52.765976] [G loss: 182.699905]
[Epoch 3/5000] [Batch 40/163] [D loss: -59.642242] [G loss: 242.636047]
[Epoch 3/5000] [Batch 45/163] [D loss: -65.882965] [G loss: 195.031784]Warning: Error detected in CudnnConvolutionBackward. Traceback of forward call that caused the error:
File "wgan_birka.py", line 278, in <module>
gradient_penalty = compute_gradient_penalty(discriminator, real_imgs.data, fake_imgs.data)
File "wgan_birka.py", line 207, in compute_gradient_penalty
d_interpolates, _ = D(interpolates)
File "/home/marat/anaconda3/envs/server/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "wgan_birka.py", line 144, in forward
x = self.block7(x)
File "/home/marat/anaconda3/envs/server/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/home/marat/anaconda3/envs/server/lib/python3.7/site-packages/torch/nn/modules/container.py", line 100, in forward
input = module(input)
File "/home/marat/anaconda3/envs/server/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/home/marat/anaconda3/envs/server/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 349, in forward
return self._conv_forward(input, self.weight)
File "/home/marat/anaconda3/envs/server/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 346, in _conv_forward
self.padding, self.dilation, self.groups)
(print_stack at /opt/conda/conda-bld/pytorch_1587428266983/work/torch/csrc/autograd/python_anomaly_mode.cpp:60)
[Epoch 3/5000] [Batch 50/163] [D loss: -49.171650] [G loss: 111.772812]
And only huge amout of batches latter got fatal error which stops calculations (anomaly detection is on)
[Epoch 4/5000] [Batch 110/163] [D loss: -33.384842] [G loss: -102.455353]
[Epoch 4/5000] [Batch 115/163] [D loss: -15.646807] [G loss: -69.029060]
[Epoch 4/5000] [Batch 120/163] [D loss: -21.404884] [G loss: 4.358668]
Traceback (most recent call last):
File "wgan_birka.py", line 278, in <module>
gradient_penalty = compute_gradient_penalty(discriminator, real_imgs.data, fake_imgs.data)
File "wgan_birka.py", line 217, in compute_gradient_penalty
only_inputs=True,
File "/home/marat/anaconda3/envs/server/lib/python3.7/site-packages/torch/autograd/__init__.py", line 158, in grad
inputs, allow_unused)
RuntimeError: Function 'CudnnConvolutionBackward' returned nan values in its 1th output.
Note I am using torch.autograd.set_detect_anomaly(True)
and num_workers=0
so it should not be weird parallelism effect.
I saved batch on fatal (last) crush but in generator network there was already sit NaN, and again by some reason pytroch did not crush on “warning” but some amout of batches later so I cant reproduce exactly batch which lead to store NaN in weights of generator.
PS I will try to check all parametrs at each weight update and stop execution if NaN is sitting where.
UPDATE:
I found moment when generator obtaned some NaN in its parameters. Surprisingly gradient of the parameters did not have any NaNs but the Adam optimizer set some generator weight to NaN.