Weird CUDA illegal memory access error

Hi all, I encountered a weird CUDA illegal memory access error. Will try to have a minimal example in a while.

During training, my code will run for several batches without any errors, then after a random amount of time there will be an illegal memory access error. Then error happened in this line:

conf_p = conf[pos]

and error messages are:

  File "/export/home/x/anaconda3/lib/python3.6/site-packages/torch/autograd/variable.py", line 74, in __getitem__
    return MaskedSelect.apply(self, key)
  File "/export/home/x/anaconda3/lib/python3.6/site-packages/torch/autograd/_functions/tensor.py", line 534, in forward
    return tensor.masked_select(mask)
RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /export/home/x/code/pytorch/torch/lib/THC
/generated/../THCReduceAll.cuh:339

interestingly, even I replace this line of code to:

print(pos)

there will still be an error

  File "/export/home/x/anaconda3/lib/python3.6/site-packages/torch/autograd/variable.py", line 119, in __repr__
    return 'Variable containing:' + self.data.__repr__()
  File "/export/home/x/anaconda3/lib/python3.6/site-packages/torch/tensor.py", line 133, in __repr__
    return str(self)
  File "/export/home/x/anaconda3/lib/python3.6/site-packages/torch/tensor.py", line 140, in __str__
    return _tensor_str._str(self)
  File "/export/home/x/anaconda3/lib/python3.6/site-packages/torch/_tensor_str.py", line 294, in _str
    strt = _tensor_str(self)
  File "/export/home/x/anaconda3/lib/python3.6/site-packages/torch/_tensor_str.py", line 142, in _tensor_str
    formatter = _number_format(self, min_sz=3 if not print_full_mat else 0)
  File "/export/home/x/anaconda3/lib/python3.6/site-packages/torch/_tensor_str.py", line 74, in _number_format
    tensor = torch.DoubleTensor(tensor.size()).copy_(tensor).abs_().view(tensor.nelement())
RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /export/home/x/code/pytorch/torch/lib/THC
/generic/THCTensorCopy.c:70

however I can run:

print(pos.size(), pos.is_contiguous())

I run it in cpu mode, below is the gdb back trace:

Thread 21 "python" received signal SIGBUS, Bus error.                                                                    [54/1858]
[Switching to Thread 0x7fff4ce01700 (LWP 32576)]
malloc_consolidate (av=av@entry=0x7ffe58000020) at malloc.c:4181
4181    malloc.c: No such file or directory.
(gdb) bt
#0  malloc_consolidate (av=av@entry=0x7ffe58000020) at malloc.c:4181
#1  0x00007ffff6a51678 in _int_free (av=0x7ffe58000020, p=<optimized out>, have_lock=0) at malloc.c:4075
#2  0x00007ffff6a5553c in __GI___libc_free (mem=<optimized out>) at malloc.c:2968
#3  0x00007fffe0b853fe in THLongStorage_free ()
   from /export/home/x/anaconda3/lib/python3.6/site-packages/torch/lib/libTH.so.1
#4  0x00007fffe0baf4e7 in THLongTensor_free ()
   from /export/home/x/anaconda3/lib/python3.6/site-packages/torch/lib/libTH.so.1
#5  0x00007fffe01b8839 in at::CPULongTensor::~CPULongTensor() [clone .localalias.31] ()
   from /export/home/x/anaconda3/lib/python3.6/site-packages/torch/lib/libATen.so.1
#6  0x00007fffecf88269 in torch::autograd::VariableImpl::~VariableImpl (this=0x7ffdd793fc40, __in_chrg=<optimized out>)
    at torch/csrc/autograd/variable.cpp:38
#7  0x00007fffecf9ad21 in at::TensorImpl::release (this=<optimized out>)
    at /export/home/x/code/pytorch/torch/lib/tmp_install/include/ATen/TensorImpl.h:31
---Type <return> to continue, or q <return> to quit---
#8  at::detail::TensorBase::~TensorBase (this=<optimized out>, __in_chrg=<optimized out>)
    at /export/home/x/code/pytorch/torch/lib/tmp_install/include/ATen/TensorBase.h:27
#9  at::Tensor::~Tensor (this=<optimized out>, __in_chrg=<optimized out>)
    at /export/home/x/code/pytorch/torch/lib/tmp_install/include/ATen/Tensor.h:33
#10 at::Tensor::reset (this=0x7fffa888c288) at /export/home/x/code/pytorch/torch/lib/tmp_install/include/ATen/Tensor.h:57
#11 THPVariable_clear (self=self@entry=0x7fffa888c278) at torch/csrc/autograd/python_variable.cpp:131
#12 0x00007fffecf9ae31 in THPVariable_dealloc (self=0x7fffa888c278) at torch/csrc/autograd/python_variable.cpp:138
#13 0x00007ffff79a75f9 in subtype_dealloc (self=0x7fffa888c278) at Objects/typeobject.c:1222
#14 0x00007ffff79a5f3e in tupledealloc (op=0x7fffa8851a08) at Objects/tupleobject.c:243
#15 0x00007fffecf9072f in THPPointer<_object>::~THPPointer (this=0x7fff4ce00850, __in_chrg=<optimized out>)
    at /export/home/x/code/pytorch/torch/csrc/utils/object_ptr.h:12
#16 torch::autograd::PyFunction::apply (this=0x7fffa877e6d0, inputs=...) at torch/csrc/autograd/python_function.cpp:123
#17 0x00007fffecf7bbf4 in torch::autograd::Function::operator() (inputs=..., this=<optimized out>)
---Type <return> to continue, or q <return> to quit---
    at /export/home/x/code/pytorch/torch/csrc/autograd/function.h:89
#18 torch::autograd::call_function (task=...) at torch/csrc/autograd/engine.cpp:208
#19 torch::autograd::Engine::evaluate_function (this=this@entry=0x7fffee198ca0 <engine>, task=...)
    at torch/csrc/autograd/engine.cpp:220
#20 0x00007fffecf7ddae in torch::autograd::Engine::thread_main (this=0x7fffee198ca0 <engine>, graph_task=0x0)
    at torch/csrc/autograd/engine.cpp:144
#21 0x00007fffecf7ab42 in torch::autograd::Engine::thread_init (this=this@entry=0x7fffee198ca0 <engine>, device=device@entry=-1)
    at torch/csrc/autograd/engine.cpp:121
#22 0x00007fffecf9da9a in torch::autograd::python::PythonEngine::thread_init (this=0x7fffee198ca0 <engine>, device=-1)
    at torch/csrc/autograd/python_engine.cpp:28
#23 0x00007fffcd559c5c in std::execute_native_thread_routine_compat (__p=<optimized out>)
    at /home/msarahan/miniconda2/conda-bld/compilers_linux-64_1507259624353/work/.build/src/gcc-7.2.0/libstdc++-v3/src/c++11/threa
d.cc:110
---Type <return> to continue, or q <return> to quit---
#24 0x00007ffff76ba6ba in start_thread (arg=0x7fff4ce01700) at pthread_create.c:333
#25 0x00007ffff6ad83dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
(gdb)     return str(self)

Seems that a piece of memory is wrongly freed

Hi,

You should run your code with CUDA_LAUNCH_BLOCKING=1 to see where the error comes from.
Because all cuda calls are asynchronous when you don’t specify this option, the python code will report the error on the next cuda call after the error. This is why trying to use the tensor or printing its content raise an error (you use the gpu for that) while printing the size or checking if it is contiguous does not (because these are cpu only operations).

1 Like

Thanks for your reply, actually I tried that method, the error message is the same as the first one

Weird,
Could you provide us with a minimal example that reproduce the problem please?

Yeah I will try that

I just found that even I set CUDA_LAUNCH_BLOCKING=1, there is still an error when I try to print the tensor. I was running

CUDA_LAUNCH_BLOCKING=1
python train.py

is this the right way to set this environment variable?

No,
if you run in 2 commands, your should use export CUDA_LAUNCH_BLOCKING=1 but that will set it for the whole terminal session.
If you use CUDA_LAUNCH_BLOCKING=1 python train.py (in one command), that will set this env variable just for this command.

Yeah I was wondering if I need to put them in one line, thanks for your reply!

I put them in the same line now, here is the error message:

terminate called after throwing an instance of 'thrust::system::system_error'
  what():  function_attributes(): after cudaFuncGetAttributes: an illegal memory access was encountered
./train.sh: line 14:  4111 Aborted                 (core dumped) CUDA_LAUNCH_BLOCKING=1 python train.py

I finally solved this problem.

Although the error message is not very helpful, I guess the illegal memory access should come from an index out of range access. So I double check all my code, and finally found that in certain batch, the groundtruth target could be larger than the number of classes in softmax. I fixed it and no more errors :slight_smile:

@albanD thanks for your time anyway

3 Likes

Good that you found the problem !

I have a similar problem.
image
Here weight_mask is a tensor.
image
After the script ran for days and this line was called couple hundred of times, this error occurred. I am not sure if I am able to reproduce this error again or how long it is going to take for it to appear again. Any thoughts on this or possible explanations for this? Thank you.

Just to share my case, I had a similar error code.
I commented the line cudnn.benchnark=True and everything works fine now.

Training code works fine with the commented line, but when I run my validation code, it crashes with the same 77 illegal access error.
Anyways, I will share more if I find something else.

6 Likes

Thanks for your solution.

I’m getting the same illegal memory access error which was caused by moving the tensors to GPU: input[key] = input=[key].cuda().

I tried setting cudnn.benmark = False and rm -rf ~/.nv following some web search but without success. Any suggestions? Thanks a lot!

EDIT: I realized that cudnn.benmark was set to True on a later line ^ ^b (I was running someone else’s git repo) and after resetting it to False the error went away!

1 Like

Have you solved this problem and found the reason of this?

I also met the same error when evaluating the model

RuntimeError: CUDA error: an illegal memory access was encountered

My code looks like

correct = 0
total = 0
for i, (input, target) in tqdm.tqdm(enumerate(data_loader), total=len(dataset)//batch_size):
    target = target.to(device)
    input = input.to(device)
    output = self.model.forward_t(input)
    c = output.argmax(dim=1)
    total += len(target)
    correct += sum(target.cpu().numpy() == c.cpu().numpy())
    acc = float(correct) / total

It is also strange that, if I do not use .cpu.numpy() to convert the data first, the result will be incorrect.

2 Likes

Hi,
I am facing the same issue. Setting cudnn.benchmark=False did not help (it was set to False from the beginning). My code crashes after a second call to some function. (I use CUDA_LAUNCH_BLOCKING=1 to find out where the error occured). Any pointers to the cause and how to fix it? thanks

File "../libs/bn.py", line 109, in forward
    self.training, self.momentum, self.eps, self.activation, self.slope)
  File "../libs/functions.py", line 99, in forward
    running_mean.mul_((1 - ctx.momentum)).add_(ctx.momentum * mean)
RuntimeError: CUDA error: an illegal memory access was encountered

When trying to print the value of the tensor running_mean (during the second call), it raises the following error:


print(running_mean)
  File "..../Venvs/pytorch.1.0.1/lib/python3.7/site-packages/torch/tensor.py", line 66, in __repr__
    return torch._tensor_str._str(self)
  File "..../Venvs/pytorch.1.0.1/lib/python3.7/site-packages/torch/_tensor_str.py", line 277, in _str
    tensor_str = _tensor_str(self, indent)
  File "..../Venvs/pytorch.1.0.1/lib/python3.7/site-packages/torch/_tensor_str.py", line 195, in _tensor_str
    formatter = _Formatter(get_summarized_data(self) if summarize else self)
  File "..../Venvs/pytorch.1.0.1/lib/python3.7/site-packages/torch/_tensor_str.py", line 84, in __init__
    nonzero_finite_vals = torch.masked_select(tensor_view, torch.isfinite(tensor_view) & tensor_view.ne(0))
  File "..../Venvs/pytorch.1.0.1/lib/python3.7/site-packages/torch/functional.py", line 271, in isfinite
    return (tensor == tensor) & (tensor.abs() != inf)
RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /pytorch/aten/src/THC/generated/../THCTensorMathCompareT.cuh:69

—> running_mean seems to have inf values!!!
It seems an issue related to the machine where the code is running. (more specifically, cuda-related. Things run fine on cpu).

Fix and possible explanation.

How i can do that? I have same problem

You just need to set the environment variable before launching your script. The simplest is CUDA_LAUNCH_BLOCKING=1 python your_script.py.

Another solution is using torch.cuda.set_device(1). That should also work.