An illegal memory access was encountered

Ian · November 30, 2018, 4:41pm

Hi, all,

I met the following problem when I run my code:

THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCTensorCopy.cpp line=20 error=77 : an illegal memory access was encountered
Traceback (most recent call last):
File “test.py”, line 102, in
train(epoch)
File “test.py”, line 79, in train
data = data.to(device)
File “/usr/local/lib/python2.7/dist-packages/torch_geometric/data/data.py”, line 104, in to
return self.apply(lambda x: x.to(device), *keys)
File “/usr/local/lib/python2.7/dist-packages/torch_geometric/data/data.py”, line 97, in apply
self[key] = func(item)
File “/usr/local/lib/python2.7/dist-packages/torch_geometric/data/data.py”, line 104, in
return self.apply(lambda x: x.to(device), *keys)
RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /pytorch/aten/src/THC/generic/THCTensorCopy.cpp:20

When I use CUDA_LAUNCH_BLOCKING=1 python test.py to pinpoint the error, I got:

/pytorch/aten/src/THC/THCTensorIndex.cu:395: void WrapIndexOp::operator()(long *, long *): block: [0,0,0], thread: [40,0,0] Assertion idx < size && idx >= -size failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:395: void WrapIndexOp::operator()(long *, long *): block: [0,0,0], thread: [58,0,0] Assertion idx < size && idx >= -size failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:395: void WrapIndexOp::operator()(long *, long *): block: [0,0,0], thread: [2,0,0] Assertion idx < size && idx >= -size failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:395: void WrapIndexOp::operator()(long *, long *): block: [0,0,0], thread: [8,0,0] Assertion idx < size && idx >= -size failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:395: void WrapIndexOp::operator()(long *, long *): block: [0,0,0], thread: [26,0,0] Assertion idx < size && idx >= -size failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:395: void WrapIndexOp::operator()(long *, long *): block: [0,0,0], thread: [31,0,0] Assertion idx < size && idx >= -size failed.
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCTensorCopy.cu line=102 error=59 : device-side assert triggered
Traceback (most recent call last):
File “test.py”, line 102, in
train(epoch)
File “test.py”, line 85, in train
F.nll_loss(model(data), data.y).backward()
File “/usr/local/lib/python2.7/dist-packages/torch/tensor.py”, line 93, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File “/usr/local/lib/python2.7/dist-packages/torch/autograd/init.py”, line 90, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: cuda runtime error (59) : device-side assert triggered at /pytorch/aten/src/THC/THCTensorCopy.cu:102

my running platform:
Pytorch: 0.4
CUDDA: 8.0
GPU: Tesla K80

It is strange this error did not occur at the same point. Sometimes it happened after backward several steps, sometimes several epoches. And when I trained the model using another dataset, it is fine.

Can anybody help me with this problem? is it the problem of the dataset dtype?