RuntimeError: CUDA error: an illegal memory access was encountered autograd

NoobCoder · June 12, 2020, 11:54pm

Hello,

I have written my class with autograd which means I have implemented the backwards too. When I run the code, I got random CUDA errors.

RuntimeError: CUDA error: an illegal memory access was encountered

This is one of the four errors I receive. These are three others:

nonzero_finite_vals = torch.masked_select(tensor_view, torch.isfinite(tensor_view) & tensor_view.ne(0))
RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /pytorch/aten/src/THC/THCReduceAll.cuh:327

RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /pytorch/aten/src/THC/THCCachingHostAllocator.cpp:278

File “/projects/ovcare/classification/ywang/myenv/lib/python3.7/site-packages/torch/autograd/init.py”, line 99, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)

Based on previous posts, some suggested to run with CUDA_LAUNCH_BLOCKING=1 python your_script.py.

I use multiple GPUs (Dataparallel), but when I use this code, my code is so slow that after one hour I did not received single batch of the first epoch. When I just try one GPU, it works, and after 4 epochs it gives me this error:

num_x = x.view(1, H, W) - x_points.view(batch_size, 1, 1)
RuntimeError: CUDA error: an illegal memory access was encountered

It saying me this line has a problem. This is my code below:

coords = torch.tensor([[h, w] for h in range(H) for w in range(W)], device=device)
x = coords[:, 0]; y = coords[:, 1]
x, y = x.reshape(H, W), y.reshape(H, W)
num_x = x.view(1, H, W) - x_points.view(batch_size, 1, 1)
num_y = y.view(1, H, W) - y_points.view(batch_size, 1, 1)

This is the @staticmethod def backward(ctx, grad_output) of my defined model.
I cannot understand where is the problem? I am defining the x and y tensors, so where is the illegal access.

Another strange thing is that I print x.view(1, H, W) and x_points.view(batch_size, 1, 1) separately before this line, and it works fine and printing correctly in the time that it crashes, but when I again want to print it after calculating num_x, I get this error.

I have read others’ posts, and they say they had a problem with indexing and …, but in my case, I am calculating the output gradient, and it works fine and the shape pf gradient is correct. I can print the shape of these tensors without any error, but when I want to print them, it gives me error.

ptrblck · June 13, 2020, 10:09am

Could you run your code on the CPU, as this would give you a potentially better error message (if not already done)?
If the code is working fine on the CPU, could you install the latest nightly binary and rerun the single-GPU code with CUDA_LAUNCH_BLOCKING=1?

NoobCoder · June 13, 2020, 10:23am

Yes, the code is running on CPUs without any errors.

By the latest nightly binary, what you mean exactly? I am using a server, and I should ask them to install, so if you provide exactly what should I tell them, I would appreciate it.

ptrblck · June 13, 2020, 10:25am

In the get started section, you should select your OS, the CUDA version you would like to use and “Preview (Nightly)” to get the nightly binary.
Alternatively you could also build PyTorch from source using the master branch, if that’s already done by your system admins.

NoobCoder · June 15, 2020, 7:28pm

The problem is now fixed uninstalling the Pytorch and Torchvision, and installing them with:

pip install torch==1.4.0+cu100 torchvision==0.5.0+cu100 -f https://download.pytorch.org/whl/torch_stable.html

Before that it was cu92, and for that, I got multiple random CUDA memory errors.

ptrblck · June 16, 2020, 7:38am

This might not have fixed the underlying issue, as you have installed an older PyTorch and torchvision version.
Let me know, if you get the chance to install the nightly binary and are still facing this issue.

douglasrizzo · August 19, 2020, 5:28pm

I had a similar issue in a specific part of my model. I do not declare any custom class that needs its own autograd.

I was having trouble finding the source of the error, so I just started blindly printing my tensors until a print statement of a certain variable (that was printed once before and was not changed since then) gave me this error.

def forward(self, x: dict, node_type: torch.Tensor):
      print(node_type)
      X = torch.zeros(node_type.size(0), self.out_features, device=self.device)
      # X = torch.empty(node_type.size(0), self.out_features, device=self.device)
      print(node_type)
      print('X', X, X.size())

      for nt in node_type.unique().tolist():
         print(nt)
         print(node_type)
         node_mask = (node_type == nt)
         print(node_mask)
         print(node_type)
         enc = self.apply_net(x[nt], nt)

         print('X', X) # this is where I got the error
         print('enc', enc)

         X[node_mask] = enc

      print(node_type)
      return X

This the output:

tensor([0, 0, 0], device='cuda:0')
tensor([0, 0, 0], device='cuda:0')
tensor([0, 0, 0], device='cuda:0')
tensor([0, 0, 0], device='cuda:0')
X tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0.]], device='cuda:0') torch.Size([3, 128])
0
tensor([0, 0, 0], device='cuda:0')
tensor([True, True, True], device='cuda:0')
tensor([0, 0, 0], device='cuda:0')
X Something happened, saving checkpoint...
Training step: 1
	Saving checkpoint...
THCudaCheck FAIL file=/pytorch/torch/csrc/generic/serialization.cpp line=31 error=700 : an illegal memory access was encountered
Traceback (most recent call last):
  File "/home/user/Code/my_agent/my_agent/training/run.py", line 514, in <module>
    runner.run()
  File "/home/user/Code/my_agent/my_agent/training/run.py", line 238, in run
    q_vals = self.model.policy_net(current_state)
  File "/home/user/.anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/user/Code/my_agent/my_agent/nn/nets.py", line 156, in forward
    x = self.encoding_layer(input_by_class, batch_node_types)
  File "/home/user/.anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/user/Code/my_agent/my_agent/nn/modules/encoding.py", line 101, in forward
    print('X', X)
  File "/home/user/.anaconda3/lib/python3.8/site-packages/torch/tensor.py", line 153, in __repr__
    return torch._tensor_str._str(self)
  File "/home/user/.anaconda3/lib/python3.8/site-packages/torch/_tensor_str.py", line 371, in _str
    return _str_intern(self)
  File "/home/user/.anaconda3/lib/python3.8/site-packages/torch/_tensor_str.py", line 351, in _str_intern
    tensor_str = _tensor_str(self, indent)
  File "/home/user/.anaconda3/lib/python3.8/site-packages/torch/_tensor_str.py", line 241, in _tensor_str
    formatter = _Formatter(get_summarized_data(self) if summarize else self)
  File "/home/user/.anaconda3/lib/python3.8/site-packages/torch/_tensor_str.py", line 89, in __init__
    nonzero_finite_vals = torch.masked_select(tensor_view, torch.isfinite(tensor_view) & tensor_view.ne(0))
RuntimeError: copy_if failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

As you can see, I printed variable X before, did not get a CUDA error, but then printed it once again and got an error. My problem started ocurring when I updated PyTorch to 1.6.0. Downgrading back to 1.5.0 solved it for me.

My System is Manjaro Linux, Python 3.8.3 on Anaconda, CUDA 10.1, NVIDIA drivers 435.21, GeForce GTX 1070.

douglasrizzo · August 19, 2020, 5:51pm

I tried installing the nightly build, but my project has other dependencies that only work with stable PyTorch, so I could not test my code on nightly.

douglasrizzo · August 19, 2020, 6:34pm

OK, thanks to the downgrade, I found out the problem was that I was applying an nn.Linear layer contained in my GPU to an input tensor that was stored in my CPU. My guess is, the operation went through without errors, this corrupted the memory in the GPU and, when I tried to access a random variable in my code that was stored in the GPU, the corrupted memory ended up throing this error. PyTorch 1.5.0 actually gave me a descriptive error and I was able to fix this.

Then, I went back to PyTorch 1.6.0 and the cudaErrorIllegalAddress error was gone.

Montey · August 25, 2020, 10:57pm

Wow, turns out I had the exact same problem. Strange that it just silently corrupts the memory without any error or warning when you do this by mistake. Thans for posting the solution!

ptrblck · August 26, 2020, 8:34am

Could you post a code snippet, which would create the proper error message in 1.5 and the illegal memory access in 1.6, please?
If the illegal memory access is still seen in the nightly binary, we should fix it as soon as possible.

fiorenza2 · November 18, 2020, 6:24pm

Hey, I’m running PyTorch 1.6 and I may have a pretty simple related repro for this, and the error is kind of weird. Basically if you run this

torch.matmul(torch.ones(3,1), torch.ones(1,3).to('cuda'))

It will work but return some garbage the first time:

tensor([[0.0000e+00, 0.0000e+00, 6.9633e-36],
        [0.0000e+00, 6.9633e-36, 0.0000e+00],
        [6.9634e-36, 0.0000e+00, 0.0000e+00]])

, then the second time it will return RuntimeError: CUDA error: an illegal memory access was encountered.

Needless to say, sending the second ones tensor to the appropriate device fixes this. Running a similar piece of code was also returning an “out of memory” error, but I can’t repro this as easily.

Hope this helps.

ptrblck · November 19, 2020, 8:12am

Could you update to PyTorch 1.7.0 or the nightly release, as I get a proper error now:

RuntimeError: Tensor for 'out' is on CPU, Tensor for argument #1 'self' is on CPU, but expected them to be on GPU (while checking arguments for addmm)

so it seems to be an already fixed issue.

klaus · July 21, 2021, 9:05pm

Hey I am facing this Cuda error: An illegal memory access encountered. I am running my program on 4 Gpus using Pytorch’s Distributed data parallel:

Traceback (most recent call last):
  File "analysis_pipelines/training_models/tabnet.py", line 150, in <module>
    distributed_tabnet(train_data_path, test_data_path, save_model_path, local_rank)
  File "analysis_pipelines/training_models/tabnet.py", line 85, in distributed_tabnet
    tabnet_model.multiprocess_fit(X_train=X_train, y_train=y_train, eval_set=eval_set, eval_metric=eval_metric, weights=1, batch_size= batch_size, patience= patience, train_dataloader=train_loader, max_epochs=max_epochs)
  File "/home/rkudupudi/rkudupudi_tf/lib/python3.6/site-packages/pytorch_tabnet/abstract_model.py", line 371, in multiprocess_fit
    self._train_epoch(train_dataloader)
  File "/home/rkudupudi/rkudupudi_tf/lib/python3.6/site-packages/pytorch_tabnet/abstract_model.py", line 633, in _train_epoch
    batch_logs = self._train_batch(X, y)
  File "/home/rkudupudi/rkudupudi_tf/lib/python3.6/site-packages/pytorch_tabnet/abstract_model.py", line 675, in _train_batch
    loss.backward()
  File "/home/rkudupudi/rkudupudi_tf/lib/python3.6/site-packages/torch/_tensor.py", line 255, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/rkudupudi/rkudupudi_tf/lib/python3.6/site-packages/torch/autograd/__init__.py", line 149, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.

The program runs on a single GPU without any problems. I tried doing CUDA_LAUCH_BLOCKING=1 while starting the program from the terminal, but the program hanged after a few epochs of training. After that I was blocked from accessing the GPUs, until I restart my machine. Kindly let me know how to fix this.
My current pytorch version is 1.9.0+cu102.
Kindly let me know how to resolve this.
Thanks

ptrblck · July 21, 2021, 9:24pm

You could run the script via compute-sanitizer or cuda-gdb to get the stacktrace or alternatively you could also create a cuda coredump via CUDA_ENABLE_COREDUMP_ON_EXCEPTION=1 and check it via cuda-gdb afterwards.
If this doesn’t work, could you post an executable code snippet to reproduce the issue?