Autograd vague error "returned NULL without setting an error"

Tomas_Lysek · February 23, 2021, 9:40pm

Hi,
I am running into strange error when using autograd:

Traceback (most recent call last):
  File "/home/user/anaconda3/envs/glowtts/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/home/user/tts/glow-tts/train.py", line 92, in train_and_eval
    train(rank, epoch, hps, generator, optimizer_g, train_loader, None, None)
  File "/home/user/tts/glow-tts/train.py", line 119, in train
    loss_g.backward()
  File "/home/user/anaconda3/envs/glowtts/lib/python3.8/site-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/user/anaconda3/envs/glowtts/lib/python3.8/site-packages/torch/autograd/__init__.py", line 145, in backward
    Variable._execution_engine.run_backward(
SystemError: <built-in method run_backward of torch._C._EngineBase object at 0x7f09d7de3840> returned NULL without setting an error

It is from this repository.

I am using CUDA 11.2 with builded pytorch (I have tried to use latest ngc pytorch image - pytorch:20.12-py3 - and it has same error and it has same strange behavior).

After few hours of investigation, i located module which is failing - it is weights in InvConvNear

If I give weight = torch.ones_like(weight).cuda() on line 224 it is not throwing exception (but obviously it’s not learning anything).

Do you have any idea how to debug this error or solve this problem?

Thanks!

Tomas_Lysek · February 24, 2021, 10:43am

After I set requires requires_grad to False on weight, It is working (but obviously it is learning wrong). Is there any way to get normal error log from pytorch? I realy don’t know how to interpret “returned NULL without seting an error”

albanD · February 24, 2021, 4:32pm

Hi,

This is the first time I see this error message :o This hasn’t happened in a while! haha

Can you re-run the program after setting TORCH_SHOW_CPP_STACKTRACES=1 please? And share the c++ stacktraces that it will print when it errors out.

Also do you have an easy way for us to reproduce this (like a simple colab notebook for example)?

ptrblck · February 26, 2021, 6:07am

Also, feel free to ping me here in case you think the issue might be caused by CUDA11.2 (or any other GPU lib).

HeyangQin · March 3, 2021, 5:44am

I got this error when trying to run https://github.com/petuum/adaptdl/tree/master/examples/pytorch-cifar. This seems to be a randomly happening error. I tried different versions of Pytorch (1.6, 1.7, 1.7.1) but the error is still here. In the end I got around it by setting CUDA_VISIBLE_DEVICES="0" to disable DDP.

NoahDrisort · June 4, 2021, 12:17pm

I have the same error, in my case:

CUDA 11.0 on A100
Conda environment with

python -m pip install torch==1.7.0+cu110 torchvision==0.8.1+cu110 torchaudio==0.7.0

Apex using the lastest from repo
I have tried all python version 3.6,7,8
=> Are there anyone have solved this problem , I test with many pytorch from source but nothing change, I think this is from Cuda11.0 because I have tried my source code successfully on V100 Cuda10.2

NoahDrisort · June 4, 2021, 1:19pm

I have fixed this by install pytorch from source and using suitable python version (my case is python = 3.6.10)

ptrblck · June 4, 2021, 4:57pm

That’s good to hear, as more recent releases ship with the latest bug fixes.
Note that apex.amp is deprecated (in case you are using this utility from apex) and you should use the native implementation via torch.cuda.amp.

amsword · September 9, 2021, 9:36pm

A general question: recently I’m using apex.amp O2 to save memory with large batch size for speed-up without accuracy drop. However, the native torch.cuda.amp does not save memory. Is there any alternative of apex.amp O2 if apex.amp is deprecated?

ptrblck · September 9, 2021, 10:33pm

No, not yet as we are investigating how the legacy “O2-style” amp could work in the native implementation.

amsword · November 7, 2021, 7:43pm

Recently, I also encountered such issues, but it seems like eventually, it is OOM issue. By reducing the batch size, it starts working as normal.