Runtime error (77)

I am trying to sum a Variable using torch.sum() on GPU and the following error happens:

 THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1501972792122/work/pytorch-0.1.12/torch/lib/THC/generated/../THCReduceAll.cuh line=334 error=77 : an illegal memory access was encountered

I searched and found some solutions such as https://github.com/torch/cutorch/issues/489 but none of them work. Any suggestions?

Interesting! Do you have a repro script for us to debug?

Thanks. Here is the code,

    loss = (target_value - out).pow(2).sum()

where target_value and out are both Variables on GPU.

after upgrade to 0.2.0 the error still remains…

THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1503966894950/work/torch/lib/THC/generated/../generic/THCTensorMathPointwise.cu line=313 error=59 : device-side assert triggered

Segmentation fault (core dumped)

This line of code has no issue. Could you post a reproducing script please? Thanks!

what do you mean by reproducing script?

I meant a code snippet that can be used to reproduce the error you are seeing.

It seems like moving the code to another place and run again solves the problem …

The same thing happened in my case.
Running the code in machine-a gives the error (cuda device).
Running the same code in machine-b passes with no errors (cuda device).
(the code runs correctly on both machines when using cpu as device).
Thanks!
Any idea why? thanks
Both machines have:

  • CUDA:
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130
  • Pytorch: 1.0.0
  • Python:
$ python
Python 3.7.0 | packaged by conda-forge | (default, Nov 12 2018, 20:15:55) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.

Difference in Ubuntu version:

  • machine-a:
$ cat /etc/*release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.4 LTS"
NAME="Ubuntu"
VERSION="16.04.4 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.4 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial
  • machine-b:
$ cat /etc/*release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu 18.04.2 LTS"
NAME="Ubuntu"
VERSION="18.04.2 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.2 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

Now things are getting really weird. On the machine-a where the code raises the error, it depends on the CUDA ID:

DEVICE = torch.device("cuda:{}".format(cuda) if torch.cuda.is_available() else "cpu")
  • When cuda=0: no error.
  • When cuda > 0: error.

Any suggestions?

Fix:
The issue was caused by my model. This last one has a part in CUDA.
Using model.to(DEVICE) does not seem to send the CUDA part into the selected GPU ID but it leaves it at the GPU ID:0. Only when DEVICE points to the GPU ID:0 where both parts of the models are located on the same device, things work fine. Otherwise, they are both separated, and this seems to raise the above error.

Now, after obtaining DEVICE, I explicitly set the Pytorch device using: torch.cuda.set_device(id). This seem to put both parts in the selected device (DEVICE). The error is no longer raised.