Hi all, I encountered a weird CUDA illegal memory access error. Will try to have a minimal example in a while.
During training, my code will run for several batches without any errors, then after a random amount of time there will be an illegal memory access error. Then error happened in this line:
conf_p = conf[pos]
and error messages are:
File "/export/home/x/anaconda3/lib/python3.6/site-packages/torch/autograd/variable.py", line 74, in __getitem__
return MaskedSelect.apply(self, key)
File "/export/home/x/anaconda3/lib/python3.6/site-packages/torch/autograd/_functions/tensor.py", line 534, in forward
return tensor.masked_select(mask)
RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /export/home/x/code/pytorch/torch/lib/THC
/generated/../THCReduceAll.cuh:339
interestingly, even I replace this line of code to:
print(pos)
there will still be an error
File "/export/home/x/anaconda3/lib/python3.6/site-packages/torch/autograd/variable.py", line 119, in __repr__
return 'Variable containing:' + self.data.__repr__()
File "/export/home/x/anaconda3/lib/python3.6/site-packages/torch/tensor.py", line 133, in __repr__
return str(self)
File "/export/home/x/anaconda3/lib/python3.6/site-packages/torch/tensor.py", line 140, in __str__
return _tensor_str._str(self)
File "/export/home/x/anaconda3/lib/python3.6/site-packages/torch/_tensor_str.py", line 294, in _str
strt = _tensor_str(self)
File "/export/home/x/anaconda3/lib/python3.6/site-packages/torch/_tensor_str.py", line 142, in _tensor_str
formatter = _number_format(self, min_sz=3 if not print_full_mat else 0)
File "/export/home/x/anaconda3/lib/python3.6/site-packages/torch/_tensor_str.py", line 74, in _number_format
tensor = torch.DoubleTensor(tensor.size()).copy_(tensor).abs_().view(tensor.nelement())
RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /export/home/x/code/pytorch/torch/lib/THC
/generic/THCTensorCopy.c:70
however I can run:
print(pos.size(), pos.is_contiguous())
I run it in cpu mode, below is the gdb back trace:
Thread 21 "python" received signal SIGBUS, Bus error. [54/1858]
[Switching to Thread 0x7fff4ce01700 (LWP 32576)]
malloc_consolidate (av=av@entry=0x7ffe58000020) at malloc.c:4181
4181 malloc.c: No such file or directory.
(gdb) bt
#0 malloc_consolidate (av=av@entry=0x7ffe58000020) at malloc.c:4181
#1 0x00007ffff6a51678 in _int_free (av=0x7ffe58000020, p=<optimized out>, have_lock=0) at malloc.c:4075
#2 0x00007ffff6a5553c in __GI___libc_free (mem=<optimized out>) at malloc.c:2968
#3 0x00007fffe0b853fe in THLongStorage_free ()
from /export/home/x/anaconda3/lib/python3.6/site-packages/torch/lib/libTH.so.1
#4 0x00007fffe0baf4e7 in THLongTensor_free ()
from /export/home/x/anaconda3/lib/python3.6/site-packages/torch/lib/libTH.so.1
#5 0x00007fffe01b8839 in at::CPULongTensor::~CPULongTensor() [clone .localalias.31] ()
from /export/home/x/anaconda3/lib/python3.6/site-packages/torch/lib/libATen.so.1
#6 0x00007fffecf88269 in torch::autograd::VariableImpl::~VariableImpl (this=0x7ffdd793fc40, __in_chrg=<optimized out>)
at torch/csrc/autograd/variable.cpp:38
#7 0x00007fffecf9ad21 in at::TensorImpl::release (this=<optimized out>)
at /export/home/x/code/pytorch/torch/lib/tmp_install/include/ATen/TensorImpl.h:31
---Type <return> to continue, or q <return> to quit---
#8 at::detail::TensorBase::~TensorBase (this=<optimized out>, __in_chrg=<optimized out>)
at /export/home/x/code/pytorch/torch/lib/tmp_install/include/ATen/TensorBase.h:27
#9 at::Tensor::~Tensor (this=<optimized out>, __in_chrg=<optimized out>)
at /export/home/x/code/pytorch/torch/lib/tmp_install/include/ATen/Tensor.h:33
#10 at::Tensor::reset (this=0x7fffa888c288) at /export/home/x/code/pytorch/torch/lib/tmp_install/include/ATen/Tensor.h:57
#11 THPVariable_clear (self=self@entry=0x7fffa888c278) at torch/csrc/autograd/python_variable.cpp:131
#12 0x00007fffecf9ae31 in THPVariable_dealloc (self=0x7fffa888c278) at torch/csrc/autograd/python_variable.cpp:138
#13 0x00007ffff79a75f9 in subtype_dealloc (self=0x7fffa888c278) at Objects/typeobject.c:1222
#14 0x00007ffff79a5f3e in tupledealloc (op=0x7fffa8851a08) at Objects/tupleobject.c:243
#15 0x00007fffecf9072f in THPPointer<_object>::~THPPointer (this=0x7fff4ce00850, __in_chrg=<optimized out>)
at /export/home/x/code/pytorch/torch/csrc/utils/object_ptr.h:12
#16 torch::autograd::PyFunction::apply (this=0x7fffa877e6d0, inputs=...) at torch/csrc/autograd/python_function.cpp:123
#17 0x00007fffecf7bbf4 in torch::autograd::Function::operator() (inputs=..., this=<optimized out>)
---Type <return> to continue, or q <return> to quit---
at /export/home/x/code/pytorch/torch/csrc/autograd/function.h:89
#18 torch::autograd::call_function (task=...) at torch/csrc/autograd/engine.cpp:208
#19 torch::autograd::Engine::evaluate_function (this=this@entry=0x7fffee198ca0 <engine>, task=...)
at torch/csrc/autograd/engine.cpp:220
#20 0x00007fffecf7ddae in torch::autograd::Engine::thread_main (this=0x7fffee198ca0 <engine>, graph_task=0x0)
at torch/csrc/autograd/engine.cpp:144
#21 0x00007fffecf7ab42 in torch::autograd::Engine::thread_init (this=this@entry=0x7fffee198ca0 <engine>, device=device@entry=-1)
at torch/csrc/autograd/engine.cpp:121
#22 0x00007fffecf9da9a in torch::autograd::python::PythonEngine::thread_init (this=0x7fffee198ca0 <engine>, device=-1)
at torch/csrc/autograd/python_engine.cpp:28
#23 0x00007fffcd559c5c in std::execute_native_thread_routine_compat (__p=<optimized out>)
at /home/msarahan/miniconda2/conda-bld/compilers_linux-64_1507259624353/work/.build/src/gcc-7.2.0/libstdc++-v3/src/c++11/threa
d.cc:110
---Type <return> to continue, or q <return> to quit---
#24 0x00007ffff76ba6ba in start_thread (arg=0x7fff4ce01700) at pthread_create.c:333
#25 0x00007ffff6ad83dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
(gdb) return str(self)
Seems that a piece of memory is wrongly freed