I am running into strange error when using autograd:
Traceback (most recent call last): File "/home/user/anaconda3/envs/glowtts/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, *args) File "/home/user/tts/glow-tts/train.py", line 92, in train_and_eval train(rank, epoch, hps, generator, optimizer_g, train_loader, None, None) File "/home/user/tts/glow-tts/train.py", line 119, in train loss_g.backward() File "/home/user/anaconda3/envs/glowtts/lib/python3.8/site-packages/torch/tensor.py", line 245, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/home/user/anaconda3/envs/glowtts/lib/python3.8/site-packages/torch/autograd/__init__.py", line 145, in backward Variable._execution_engine.run_backward( SystemError: <built-in method run_backward of torch._C._EngineBase object at 0x7f09d7de3840> returned NULL without setting an error
It is from this repository.
I am using CUDA 11.2 with builded pytorch (I have tried to use latest ngc pytorch image - pytorch:20.12-py3 - and it has same error and it has same strange behavior).
After few hours of investigation, i located module which is failing - it is weights in InvConvNear
If I give
weight = torch.ones_like(weight).cuda() on line 224 it is not throwing exception (but obviously it’s not learning anything).
Do you have any idea how to debug this error or solve this problem?