I am getting a Cuda error even when I comment it out

Geoffrey_Payne · February 6, 2020, 7:45pm

I am getting this error message

  File "main.py", line 185, in <module>
    input = input.to(device)
RuntimeError: CUDA error: device-side assert triggered

I try commenting out these lines of code;

#if device == torch.device(“cuda”):
#input = input.to(device)
output = skipgram_model(input)

But then I get this error message;
RuntimeError: Expected object of device type cuda but got device type cpu for argument #3 'index' in call to _th_index_select

How do I fix this?

albanD · February 6, 2020, 7:48pm

Could you run your code while setting CUDA_LAUNCH_BLOCKING=1 python your_code.py and see where the error comes from please?
The cuda api is asynchronous by default so the stack trace above might we wrong.

Geoffrey_Payne · February 6, 2020, 8:24pm

I put the line
CUDA_LAUNCH_BLOCKING=1
at the beginning of my code, but I do not see any extra information

albanD · February 6, 2020, 8:25pm

It should be in the command line when you run your code. Not in python.

Geoffrey_Payne · February 6, 2020, 8:29pm

My command line is via PUTTY.
I have just run
sbatch main.sh CUDA_LAUNCH_BLOCKING=1

Not sure this is what you meant. I normally run my code from VS Code rather than command line. However I also run it on a remote server as it takes time to run.

albanD · February 6, 2020, 8:45pm

You will need to set the environment variable CUDA_LAUNCH_BLOCKING=1 before running any pytorch code.

So if it is simpler for your, you can set in the first line of your python script import os; os.environ["CUDA_LAUNCH_BLOCKING"] = "1"

Geoffrey_Payne · February 7, 2020, 9:53am

Thank you for that.
This time I got;

Traceback (most recent call last):
File “main.py”, line 193, in
loss_window = F.nll_loss(output, Y_d)
File “/home/xxx.xxx.ac.uk/xxxx/.local/lib/python3.5/site-packages/torch/nn/functional.py”, line 1838, in nll_loss
ret = torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
RuntimeError: cuda runtime error (710) : device-side assert triggered at /pytorch/aten/src/THCUNN/generic/ClassNLLCriterion.cu:110

albanD · February 7, 2020, 3:59pm

Great. So this means that the problem comes from the nll loss.
The most common reason here is that Y_d contains indices that are invalid. either negative or larger than the number of scores in the output.