Error while training: 'RuntimeError: CUDA error: an illegal memory access was encountered'

bbacken · July 19, 2022, 7:01pm

Hi!

I am trying to train an HRViT model following the instructions here. However, it is throwing an error about illegal memory access.

Versions:

pytorch 1.8.0+cu111
torchvision 0.9.0+cu111

Here is the whole traceback:

Traceback (most recent call last):
  File "mmseg_train.py", line 144, in <module>
    train_segmentor(model, dataset, cfg, distributed=False, validate=True, meta=dict())
  File "/workspace/Swin-Transformer-Semantic-Segmentation/mmseg/apis/train.py", line 116, in train_segmentor
    runner.run(data_loaders, cfg.workflow)
  File "/usr/local/lib/python3.8/dist-packages/mmcv/runner/iter_based_runner.py", line 131, in run
    iter_runner(iter_loaders[i], **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/mmcv/runner/iter_based_runner.py", line 66, in train
    self.call_hook('after_train_iter')
  File "/usr/local/lib/python3.8/dist-packages/mmcv/runner/base_runner.py", line 307, in call_hook
    getattr(hook, fn_name)(self)
  File "/usr/local/lib/python3.8/dist-packages/mmcv/runner/hooks/optimizer.py", line 27, in after_train_iter
    runner.outputs['loss'].backward()
  File "/usr/local/lib/python3.8/dist-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/usr/local/lib/python3.8/dist-packages/torch/autograd/__init__.py", line 145, in backward
    Variable._execution_engine.run_backward(
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:733 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f22e36962f2 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7f22e369367b in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x809 (0x7f22e38ee1f9 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7f22e367e3a4 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #4: <unknown function> + 0xcb5fe9 (0x7f226e5b3fe9 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: <unknown function> + 0x2bac153 (0x7f22704aa153 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0x32114d2 (0x7f2270b0f4d2 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #7: torch::autograd::deleteNode(torch::autograd::Node*) + 0x7f (0x7f2270b0f57f in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0x31f9dc8 (0x7f2270af7dc8 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #9: c10::TensorImpl::release_resources() + 0x20 (0x7f22e367e370 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #10: <unknown function> + 0x6e473a (0x7f22e45f673a in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #11: <unknown function> + 0x6e47d1 (0x7f22e45f67d1 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #12: python() [0x5a614c]
frame #13: python() [0x5cbec3]
frame #14: python() [0x5d1aec]
frame #15: python() [0x5d1ca7]
frame #16: python() [0x5a605d]
frame #17: python() [0x5ebbd8]
frame #18: python() [0x542918]
frame #19: python() [0x54296a]
<omitting python frames>
frame #25: __libc_start_main + 0xf3 (0x7f22e8ea40b3 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)

Any ideas of what’s going on here? Thank you!

ptrblck · July 19, 2022, 8:51pm

Could you update to the latest stable or nightly release and check if you are still hitting this issue, please?

bbacken · July 19, 2022, 9:07pm

When I use a newer version of Pytorch, I get the following error:

AttributeError: EncoderDecoder: HRViT: 'super' object has no attribute '_specify_ddp_gpu_num'

I believe one of the libraries this model is built on (mmcv) relies on an older version of Pytorch.

m110 · April 20, 2023, 1:41pm

I have same problem with pytorch 1.8.1+cu102 and 1.8.0+cu102 while running 2DPASS model.
Have you solved this error already?