Hi,
I am running into strange error when using autograd:
Traceback (most recent call last):
File "/home/user/anaconda3/envs/glowtts/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/home/user/tts/glow-tts/train.py", line 92, in train_and_eval
train(rank, epoch, hps, generator, optimizer_g, train_loader, None, None)
File "/home/user/tts/glow-tts/train.py", line 119, in train
loss_g.backward()
File "/home/user/anaconda3/envs/glowtts/lib/python3.8/site-packages/torch/tensor.py", line 245, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/user/anaconda3/envs/glowtts/lib/python3.8/site-packages/torch/autograd/__init__.py", line 145, in backward
Variable._execution_engine.run_backward(
SystemError: <built-in method run_backward of torch._C._EngineBase object at 0x7f09d7de3840> returned NULL without setting an error
I am using CUDA 11.2 with builded pytorch (I have tried to use latest ngc pytorch image - pytorch:20.12-py3 - and it has same error and it has same strange behavior).
After few hours of investigation, i located module which is failing - it is weights in InvConvNear
If I give weight = torch.ones_like(weight).cuda() on line 224 it is not throwing exception (but obviously it’s not learning anything).
Do you have any idea how to debug this error or solve this problem?
After I set requires requires_grad to False on weight, It is working (but obviously it is learning wrong). Is there any way to get normal error log from pytorch? I realy don’t know how to interpret “returned NULL without seting an error”
I got this error when trying to run https://github.com/petuum/adaptdl/tree/master/examples/pytorch-cifar. This seems to be a randomly happening error. I tried different versions of Pytorch (1.6, 1.7, 1.7.1) but the error is still here. In the end I got around it by setting CUDA_VISIBLE_DEVICES="0" to disable DDP.
I have tried all python version 3.6,7,8
=> Are there anyone have solved this problem , I test with many pytorch from source but nothing change, I think this is from Cuda11.0 because I have tried my source code successfully on V100 Cuda10.2
That’s good to hear, as more recent releases ship with the latest bug fixes.
Note that apex.amp is deprecated (in case you are using this utility from apex) and you should use the native implementation via torch.cuda.amp.
A general question: recently I’m using apex.amp O2 to save memory with large batch size for speed-up without accuracy drop. However, the native torch.cuda.amp does not save memory. Is there any alternative of apex.amp O2 if apex.amp is deprecated?