Device-side assert triggered & an illegal memory access was encountered

Ian · December 3, 2018, 10:45am

Hi, all,

I ran the same code.

sometimes I got the error:
RuntimeError: cuda runtime error (59) : device-side assert triggered at /pytorch/aten/src/TH C/generic/THCTensorCopy.cpp:20

sometimes I got the error:
RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /pytorch/aten/src/THC/generic/THCTensorCopy.cpp:21

Can someone can help me with this problem?

albanD · December 3, 2018, 10:51am

Hi,

Could you run the code with CUDA_LAUNCH_BLOCKING=1 and give here the stack trace you’re getting?
This is due to invalid indexing of a cuda tensor. This can be caused for example if some of your ground truth labels are larger than the number of labels.

Ian · December 3, 2018, 11:06am

Thanks, AlbanD,

after using CUDA_LAUNCH_BLOCKING=1 , I got:

THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=663 error=77 : an illegal memory access was encountered
Traceback (most recent call last):
File “test.py”, line 103, in
train(epoch)
File “test.py”, line 86, in train
loss.backward()
File “/usr/local/lib/python2.7/dist-packages/torch/tensor.py”, line 93, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File “/usr/local/lib/python2.7/dist-packages/torch/autograd/init.py”, line 90, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /pytorch/aten/src/THC/THCGeneral.cpp:663

BTW, loss definition is torch.nn.functional.nll_loss() and the problem happens after several backward training steps.

albanD · December 3, 2018, 11:09am

You should double check that the target that you give to your loss function verifies target – (N) where each value is 0≤targets[i]≤C−1 as stated in the doc.

If these are valid, you can enable the anomaly detection mode to get more information about the error in the backward pass.

Ian · December 3, 2018, 3:39pm

Thanks, AlbanD,

I can confirm the target input to loss() is correct.

By using anomaly detection mode, I still can got different errors as following when running the code. And I did not shuffle the data input to the model, so every time the input data is the same.

(1):
THCudaCheck FAIL file=/pytorch/aten/src/THC/generated/…/THCTensorMathCompareT.cuh line=71 error=77 : an illegal memory access was encountered
sys:1: RuntimeWarning: Traceback of forward call that caused the error:
File “test.py”, line 105, in
train(epoch)
File “test.py”, line 85, in train
end_point = model(data)
File “/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py”, line 477, in call
result = self.forward(*input, **kwargs)
File “test.py”, line 50, in forward
data = max_pool(cluster, data, transform=transform)
File “/usr/local/lib/python2.7/dist-packages/torch_geometric/nn/pool/max_pool.py”, line 29, in max_pool
x = _max_pool_x(cluster, data.x)
File “/usr/local/lib/python2.7/dist-packages/torch_geometric/nn/pool/max_pool.py”, line 10, in _max_pool_x
x, _ = scatter_max(x, cluster, dim=0, dim_size=size, fill_value=fill)
File “/usr/local/lib/python2.7/dist-packages/torch_scatter/max.py”, line 99, in scatter_max
return ScatterMax.apply(out, src, index, dim)

Traceback (most recent call last):
File “test.py”, line 105, in
train(epoch)
File “test.py”, line 88, in train
loss.backward()
File “/usr/local/lib/python2.7/dist-packages/torch/tensor.py”, line 93, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File “/usr/local/lib/python2.7/dist-packages/torch/autograd/init.py”, line 90, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /pytorch/aten/src/THC/generated/…/THCTensorMathCompareT.cuh:71

(2):
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCTensorCopy.cu line=102 error=59 : device-side assert triggered
sys:1: RuntimeWarning: Traceback of forward call that caused the error:
File “test.py”, line 105, in
train(epoch)
File “test.py”, line 85, in train
end_point = model(data)
File “/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py”, line 477, in call
result = self.forward(*input, **kwargs)
File “test.py”, line 48, in forward
data.x = F.elu(self.conv2(data.x, data.edge_index, data.edge_attr))
File “/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py”, line 477, in call
result = self.forward(*input, **kwargs)
File “/usr/local/lib/python2.7/dist-packages/torch_geometric/nn/conv/spline_conv.py”, line 88, in forward
self.norm, self.root, self.bias)
File “/usr/local/lib/python2.7/dist-packages/torch_spline_conv/conv.py”, line 62, in apply
out = SplineWeighting.apply(x[col], weight, *data)

Traceback (most recent call last):
File “test.py”, line 105, in
train(epoch)
File “test.py”, line 88, in train
loss.backward()
File “/usr/local/lib/python2.7/dist-packages/torch/tensor.py”, line 93, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File “/usr/local/lib/python2.7/dist-packages/torch/autograd/init.py”, line 90, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: cuda runtime error (59) : device-side assert triggered at /pytorch/aten/src/THC/THCTensorCopy.cu:102

albanD · December 3, 2018, 4:37pm

Hi,

You have a custom max pooling implementation in torch_geometric? The indexing error comes from the fact that during the backward of the pooling, the indices given to the ScatterMax Function are not correct.

The other one is some device assert that fails when calling the function SplineWeighting. I’m not sure what it does, it’s from your torch_spline_conv package, but something must be wrong with it.

Ian · December 4, 2018, 11:06am

Thanks, AlbanD,
your advice is very helpful, much appreciated!

sbelharbi · March 26, 2019, 12:49pm

Hi @Ian
Did you find the cause of the issue? and how did you fix it? thanks
I am facing a similar issue. However, I get only the error:

RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /pytorch/aten/src/THC/generated/../THCTensorMathCompareT.cuh:69

when trying to access to a tensor. (such as printing it)

This issue seems to be related to the machine where it is running.