I am running into strange error when using autograd:
Traceback (most recent call last):
File "/home/user/anaconda3/envs/glowtts/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
File "/home/user/tts/glow-tts/train.py", line 92, in train_and_eval
train(rank, epoch, hps, generator, optimizer_g, train_loader, None, None)
File "/home/user/tts/glow-tts/train.py", line 119, in train
File "/home/user/anaconda3/envs/glowtts/lib/python3.8/site-packages/torch/tensor.py", line 245, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/user/anaconda3/envs/glowtts/lib/python3.8/site-packages/torch/autograd/__init__.py", line 145, in backward
SystemError: <built-in method run_backward of torch._C._EngineBase object at 0x7f09d7de3840> returned NULL without setting an error
It is from this repository.
I am using CUDA 11.2 with builded pytorch (I have tried to use latest ngc pytorch image - pytorch:20.12-py3 - and it has same error and it has same strange behavior).
After few hours of investigation, i located module which is failing - it is weights in InvConvNear
If I give
weight = torch.ones_like(weight).cuda() on line 224 it is not throwing exception (but obviously it’s not learning anything).
Do you have any idea how to debug this error or solve this problem?
After I set requires requires_grad to False on weight, It is working (but obviously it is learning wrong). Is there any way to get normal error log from pytorch? I realy don’t know how to interpret “returned NULL without seting an error”
This is the first time I see this error message :o This hasn’t happened in a while! haha
Can you re-run the program after setting
TORCH_SHOW_CPP_STACKTRACES=1 please? And share the c++ stacktraces that it will print when it errors out.
Also do you have an easy way for us to reproduce this (like a simple colab notebook for example)?
Also, feel free to ping me here in case you think the issue might be caused by CUDA11.2 (or any other GPU lib).
I got this error when trying to run https://github.com/petuum/adaptdl/tree/master/examples/pytorch-cifar. This seems to be a randomly happening error. I tried different versions of Pytorch (1.6, 1.7, 1.7.1) but the error is still here. In the end I got around it by setting
CUDA_VISIBLE_DEVICES="0" to disable DDP.
I have the same error, in my case:
- CUDA 11.0 on A100
- Conda environment with
python -m pip install torch==1.7.0+cu110 torchvision==0.8.1+cu110 torchaudio==0.7.0
- Apex using the lastest from repo
- I have tried all python version 3.6,7,8
=> Are there anyone have solved this problem , I test with many pytorch from source but nothing change, I think this is from Cuda11.0 because I have tried my source code successfully on V100 Cuda10.2
I have fixed this by install pytorch from source and using suitable python version (my case is python = 3.6.10)
That’s good to hear, as more recent releases ship with the latest bug fixes.
apex.amp is deprecated (in case you are using this utility from
apex) and you should use the native implementation via
A general question: recently I’m using apex.amp O2 to save memory with large batch size for speed-up without accuracy drop. However, the native torch.cuda.amp does not save memory. Is there any alternative of apex.amp O2 if apex.amp is deprecated?
No, not yet as we are investigating how the legacy “O2-style” amp could work in the native implementation.
Recently, I also encountered such issues, but it seems like eventually, it is OOM issue. By reducing the batch size, it starts working as normal.