Error while training: 'RuntimeError: CUDA error: an illegal memory access was encountered'

Hi!

I am trying to train an HRViT model following the instructions here. However, it is throwing an error about illegal memory access.

Versions:

  • pytorch 1.8.0+cu111
  • torchvision 0.9.0+cu111

Here is the whole traceback:

Traceback (most recent call last):
  File "mmseg_train.py", line 144, in <module>
    train_segmentor(model, dataset, cfg, distributed=False, validate=True, meta=dict())
  File "/workspace/Swin-Transformer-Semantic-Segmentation/mmseg/apis/train.py", line 116, in train_segmentor
    runner.run(data_loaders, cfg.workflow)
  File "/usr/local/lib/python3.8/dist-packages/mmcv/runner/iter_based_runner.py", line 131, in run
    iter_runner(iter_loaders[i], **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/mmcv/runner/iter_based_runner.py", line 66, in train
    self.call_hook('after_train_iter')
  File "/usr/local/lib/python3.8/dist-packages/mmcv/runner/base_runner.py", line 307, in call_hook
    getattr(hook, fn_name)(self)
  File "/usr/local/lib/python3.8/dist-packages/mmcv/runner/hooks/optimizer.py", line 27, in after_train_iter
    runner.outputs['loss'].backward()
  File "/usr/local/lib/python3.8/dist-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/usr/local/lib/python3.8/dist-packages/torch/autograd/__init__.py", line 145, in backward
    Variable._execution_engine.run_backward(
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:733 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f22e36962f2 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7f22e369367b in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x809 (0x7f22e38ee1f9 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7f22e367e3a4 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #4: <unknown function> + 0xcb5fe9 (0x7f226e5b3fe9 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: <unknown function> + 0x2bac153 (0x7f22704aa153 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0x32114d2 (0x7f2270b0f4d2 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #7: torch::autograd::deleteNode(torch::autograd::Node*) + 0x7f (0x7f2270b0f57f in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0x31f9dc8 (0x7f2270af7dc8 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #9: c10::TensorImpl::release_resources() + 0x20 (0x7f22e367e370 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #10: <unknown function> + 0x6e473a (0x7f22e45f673a in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #11: <unknown function> + 0x6e47d1 (0x7f22e45f67d1 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #12: python() [0x5a614c]
frame #13: python() [0x5cbec3]
frame #14: python() [0x5d1aec]
frame #15: python() [0x5d1ca7]
frame #16: python() [0x5a605d]
frame #17: python() [0x5ebbd8]
frame #18: python() [0x542918]
frame #19: python() [0x54296a]
<omitting python frames>
frame #25: __libc_start_main + 0xf3 (0x7f22e8ea40b3 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)

Any ideas of what’s going on here? Thank you!

Could you update to the latest stable or nightly release and check if you are still hitting this issue, please?

When I use a newer version of Pytorch, I get the following error:

AttributeError: EncoderDecoder: HRViT: 'super' object has no attribute '_specify_ddp_gpu_num'

I believe one of the libraries this model is built on (mmcv) relies on an older version of Pytorch.

I have same problem with pytorch 1.8.1+cu102 and 1.8.0+cu102 while running 2DPASS model.
Have you solved this error already?