Weird CUDA illegal memory access error

Hi,

You should run your code with CUDA_LAUNCH_BLOCKING=1 to see where the error comes from.
Because all cuda calls are asynchronous when you don’t specify this option, the python code will report the error on the next cuda call after the error. This is why trying to use the tensor or printing its content raise an error (you use the gpu for that) while printing the size or checking if it is contiguous does not (because these are cpu only operations).

1 Like

Thanks for your reply, actually I tried that method, the error message is the same as the first one

Weird,
Could you provide us with a minimal example that reproduce the problem please?

Yeah I will try that

I just found that even I set CUDA_LAUNCH_BLOCKING=1, there is still an error when I try to print the tensor. I was running

CUDA_LAUNCH_BLOCKING=1
python train.py

is this the right way to set this environment variable?

No,
if you run in 2 commands, your should use export CUDA_LAUNCH_BLOCKING=1 but that will set it for the whole terminal session.
If you use CUDA_LAUNCH_BLOCKING=1 python train.py (in one command), that will set this env variable just for this command.

Yeah I was wondering if I need to put them in one line, thanks for your reply!

I put them in the same line now, here is the error message:

terminate called after throwing an instance of 'thrust::system::system_error'
  what():  function_attributes(): after cudaFuncGetAttributes: an illegal memory access was encountered
./train.sh: line 14:  4111 Aborted                 (core dumped) CUDA_LAUNCH_BLOCKING=1 python train.py

I finally solved this problem.

Although the error message is not very helpful, I guess the illegal memory access should come from an index out of range access. So I double check all my code, and finally found that in certain batch, the groundtruth target could be larger than the number of classes in softmax. I fixed it and no more errors :slight_smile:

@albanD thanks for your time anyway

3 Likes

Good that you found the problem !

I have a similar problem.
image
Here weight_mask is a tensor.
image
After the script ran for days and this line was called couple hundred of times, this error occurred. I am not sure if I am able to reproduce this error again or how long it is going to take for it to appear again. Any thoughts on this or possible explanations for this? Thank you.

Just to share my case, I had a similar error code.
I commented the line cudnn.benchnark=True and everything works fine now.

Training code works fine with the commented line, but when I run my validation code, it crashes with the same 77 illegal access error.
Anyways, I will share more if I find something else.

6 Likes

Thanks for your solution.

I’m getting the same illegal memory access error which was caused by moving the tensors to GPU: input[key] = input=[key].cuda().

I tried setting cudnn.benmark = False and rm -rf ~/.nv following some web search but without success. Any suggestions? Thanks a lot!

EDIT: I realized that cudnn.benmark was set to True on a later line ^ ^b (I was running someone else’s git repo) and after resetting it to False the error went away!

1 Like

Have you solved this problem and found the reason of this?

I also met the same error when evaluating the model

RuntimeError: CUDA error: an illegal memory access was encountered

My code looks like

correct = 0
total = 0
for i, (input, target) in tqdm.tqdm(enumerate(data_loader), total=len(dataset)//batch_size):
    target = target.to(device)
    input = input.to(device)
    output = self.model.forward_t(input)
    c = output.argmax(dim=1)
    total += len(target)
    correct += sum(target.cpu().numpy() == c.cpu().numpy())
    acc = float(correct) / total

It is also strange that, if I do not use .cpu.numpy() to convert the data first, the result will be incorrect.

2 Likes

Hi,
I am facing the same issue. Setting cudnn.benchmark=False did not help (it was set to False from the beginning). My code crashes after a second call to some function. (I use CUDA_LAUNCH_BLOCKING=1 to find out where the error occured). Any pointers to the cause and how to fix it? thanks

File "../libs/bn.py", line 109, in forward
    self.training, self.momentum, self.eps, self.activation, self.slope)
  File "../libs/functions.py", line 99, in forward
    running_mean.mul_((1 - ctx.momentum)).add_(ctx.momentum * mean)
RuntimeError: CUDA error: an illegal memory access was encountered

When trying to print the value of the tensor running_mean (during the second call), it raises the following error:


print(running_mean)
  File "..../Venvs/pytorch.1.0.1/lib/python3.7/site-packages/torch/tensor.py", line 66, in __repr__
    return torch._tensor_str._str(self)
  File "..../Venvs/pytorch.1.0.1/lib/python3.7/site-packages/torch/_tensor_str.py", line 277, in _str
    tensor_str = _tensor_str(self, indent)
  File "..../Venvs/pytorch.1.0.1/lib/python3.7/site-packages/torch/_tensor_str.py", line 195, in _tensor_str
    formatter = _Formatter(get_summarized_data(self) if summarize else self)
  File "..../Venvs/pytorch.1.0.1/lib/python3.7/site-packages/torch/_tensor_str.py", line 84, in __init__
    nonzero_finite_vals = torch.masked_select(tensor_view, torch.isfinite(tensor_view) & tensor_view.ne(0))
  File "..../Venvs/pytorch.1.0.1/lib/python3.7/site-packages/torch/functional.py", line 271, in isfinite
    return (tensor == tensor) & (tensor.abs() != inf)
RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /pytorch/aten/src/THC/generated/../THCTensorMathCompareT.cuh:69

—> running_mean seems to have inf values!!!
It seems an issue related to the machine where the code is running. (more specifically, cuda-related. Things run fine on cpu).

Fix and possible explanation.

How i can do that? I have same problem

You just need to set the environment variable before launching your script. The simplest is CUDA_LAUNCH_BLOCKING=1 python your_script.py.

Another solution is using torch.cuda.set_device(1). That should also work.

I played around Wav2Lip
In my case:
imgs = torch.from_numpy(imgs).float().to(device)
generated the same error…

And I need explicitly set:

torch.backends.cudnn.benchmark = False # was True

and now it works like a magic!
What is the reason behind that?!
Cordially,
Constantine.