What's the meaning of this error?How can I debug when I use GPU?

RuntimeError                              Traceback (most recent call last)
<ipython-input-12-1c0a7c8d9dd2> in <module>()
     23     loss += Loss(out, target)
     24 
---> 25     loss.backward()
     26     optimizer.step()
     27 

/usr/local/lib/python3.5/site-packages/torch/autograd/variable.py in backward(self, gradient, retain_variables)
    144                     'or with gradient w.r.t. the variable')
    145             gradient = self.data.new().resize_as_(self.data).fill_(1)
--> 146         self._execution_engine.run_backward((self,), (gradient,), retain_variables)
    147 
    148     def register_hook(self, hook):

/usr/local/lib/python3.5/site-packages/torch/nn/_functions/thnn/pooling.py in backward(self, grad_output, _indices_grad)
     61                                                          self.pad, 0,
     62                                                          self.dilation, 1,
---> 63                                                          self.ceil_mode)
     64         grad_input = grad_input.squeeze(2)
     65         return grad_input

RuntimeError: cuda runtime error (59) : device-side assert triggered at /b/wheel/pytorch-src/torch/lib/THCUNN/generic/SpatialDilatedMaxPooling.cu:228

It seems there are more problems than I just use cpu.I do not know why.

RuntimeError: cuda runtime error (59) : device-side assert triggered at /b/wheel/pytorch-src/torch/lib/THCUNN/generic/Threshold.cu:66

is there a best practice to debug these problems?

1 Like

Hi,

Could you run your code with the CUDA_LAUNCH_BLOCKING=1 env variable and post the new stack trace please.
You can do that by running CUDA_LAUNCH_BLOCKING=1 python your_script.py.

5 Likes

After I run CUDA_LAUNCH_BLOCKING=1 python HAN.py:

/home/quoniammm/anaconda3/lib/python3.6/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
THCudaCheck FAIL file=/pytorch/torch/lib/THC/THCGeneral.c line=70 error=30 : unknown error
Traceback (most recent call last):
  File "HAN.py", line 264, in <module>
    word_attn.cuda()
  File "/home/quoniammm/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 147, in cuda
    return self._apply(lambda t: t.cuda(device_id))
  File "/home/quoniammm/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 118, in _apply
    module._apply(fn)
  File "/home/quoniammm/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 124, in _apply
    param.data = fn(param.data)
  File "/home/quoniammm/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 147, in <lambda>
    return self._apply(lambda t: t.cuda(device_id))
  File "/home/quoniammm/anaconda3/lib/python3.6/site-packages/torch/_utils.py", line 66, in _cuda
    return new_type(self.size()).copy_(self, async)
  File "/home/quoniammm/anaconda3/lib/python3.6/site-packages/torch/cuda/__init__.py", line 266, in _lazy_new
    _lazy_init()
  File "/home/quoniammm/anaconda3/lib/python3.6/site-packages/torch/cuda/__init__.py", line 85, in _lazy_init
    torch._C._cuda_init()
RuntimeError: cuda runtime error (30) : unknown error at /pytorch/torch/lib/THC/THCGeneral.c:70

The result is as the same as it in notebook.What is the use of CUDA_LAUNCH_BLOCKING=1?

I still feel confused about the cuda runtime error

How can I debug it?

And then, I upload my program to FloydHub.The error disappeared.What is the reason of it?My nvidia GPU is GTX 1050.I know it’s memory is small.However, The error don’t tell me it’s the memory problem. It’s weird.

Hi,

CUDA_LAUNCH_BLOCKING make cuda report the error where it actually occurs.
Since the problem is at the cuda initialization function and does not appear on different machine I would guess that your cuda install is not working properly, you may want to reinstall it properly and test it with the cuda samples.

1 Like

Last time, I just ignore it and run the program in the aws.Today, I run a new program, and I meet the same problem.And I just restart the computer.The problem disappeared. It’s so weird.What’s the reason of it?
Can you explain it?Thanks.

Not sure what it is, but given that it is during cuda initialization and is fixed with a reboot, I would guess your gpu/driver was in a bad state.

2 Likes

Maybe it’s completely unrelated to your problem, but this week Ubuntu updated the NVidia drivers on my machine (more or less automatically using the software updater) and PyTorch couldn’t use cuda anymore. After a restart, the error vanished.

2 Likes

I’m also seeing this problem. If I have a screen temperature controller (Redshift) on startup, it won’t work. If I watch Netflix, it won’t work. Somewhat related: if I have any variables in CUDA memory in a Jupyter Notebook, I can’t run Redshift.

Ubuntu 16.04, Python 3.5, Torch 0.2.0, GTX 850M

Is there anyway to run CUDA_LAUNCH_BLOCKING=1 from my jupyter notebook? OR is their any other metric?

4 Likes

Probably you could use os.environ['CUDA_LAUNCH_BLOCKING'] = 1 at the beginning of your notebook before importing any other library.
If that doesn’t work, you could export the notebook as a Python script (.py) and run it in your terminal.

7 Likes

Cool!
Something to try out…thanks sir! :slight_smile:

Should os.environ['CUDA_LAUNCH_BLOCKING'] = 1 be os.environ['CUDA_LAUNCH_BLOCKING'] = '1'?

os.environ['CUDA_LAUNCH_BLOCKING'] = 1 will raise the exception:

TypeError: str expected, not int

12 Likes

We must use os.environ[‘CUDA_LAUNCH_BLOCKING’] = 1 to get the desired answer to debug the issue.

CUDA_LAUNCH_BLOCKING=1 python3 file.py

‘CUDA_LAUNCH_BLOCKING’ unknown command. How should I run my code with this ?

I guess you might be using Windows, which might not be able to use env variables in its default terminal (I’m just guessing, as I’m not using Windows)?
If so, you could try to set this env variable in the “System Properties” and rerun your script.

Yeah you’re right, but i also with os.environ function, but it was not allowing me to give it int but only string so i passed “1” but the error message stay the same…
Anyway, even with the default error message I found my error.

For those who still have the error and use BCELoss, make sure values are between 0 and 1.

Thx ptrblc for your reply.

1 Like

I have to reboot my system so that the error goes away. Even these commands don’t work:

sudo rmmod nvidia_uvm
sudo modprobe nvidia_uvm

So is there a way to get rid of this error without always rebooting it?

Hi, I am getting the same error while training my model on Google Colab Pro. How do I resolve the same setting CUDA_LAUNCH_BLOCKiNG as 1 is causing my colab window to crash. It omits the code cell I wrote this command and moves on further in the program.

1 Like

Closing the IDE helped me in this case. nvidia-smi showed there is a process running .