Hi!
I am trying to train an HRViT model following the instructions here. However, it is throwing an error about illegal memory access.
Versions:
- pytorch 1.8.0+cu111
- torchvision 0.9.0+cu111
Here is the whole traceback:
Traceback (most recent call last):
File "mmseg_train.py", line 144, in <module>
train_segmentor(model, dataset, cfg, distributed=False, validate=True, meta=dict())
File "/workspace/Swin-Transformer-Semantic-Segmentation/mmseg/apis/train.py", line 116, in train_segmentor
runner.run(data_loaders, cfg.workflow)
File "/usr/local/lib/python3.8/dist-packages/mmcv/runner/iter_based_runner.py", line 131, in run
iter_runner(iter_loaders[i], **kwargs)
File "/usr/local/lib/python3.8/dist-packages/mmcv/runner/iter_based_runner.py", line 66, in train
self.call_hook('after_train_iter')
File "/usr/local/lib/python3.8/dist-packages/mmcv/runner/base_runner.py", line 307, in call_hook
getattr(hook, fn_name)(self)
File "/usr/local/lib/python3.8/dist-packages/mmcv/runner/hooks/optimizer.py", line 27, in after_train_iter
runner.outputs['loss'].backward()
File "/usr/local/lib/python3.8/dist-packages/torch/tensor.py", line 245, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/usr/local/lib/python3.8/dist-packages/torch/autograd/__init__.py", line 145, in backward
Variable._execution_engine.run_backward(
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:733 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f22e36962f2 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7f22e369367b in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x809 (0x7f22e38ee1f9 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7f22e367e3a4 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #4: <unknown function> + 0xcb5fe9 (0x7f226e5b3fe9 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: <unknown function> + 0x2bac153 (0x7f22704aa153 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0x32114d2 (0x7f2270b0f4d2 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #7: torch::autograd::deleteNode(torch::autograd::Node*) + 0x7f (0x7f2270b0f57f in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0x31f9dc8 (0x7f2270af7dc8 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #9: c10::TensorImpl::release_resources() + 0x20 (0x7f22e367e370 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #10: <unknown function> + 0x6e473a (0x7f22e45f673a in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #11: <unknown function> + 0x6e47d1 (0x7f22e45f67d1 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #12: python() [0x5a614c]
frame #13: python() [0x5cbec3]
frame #14: python() [0x5d1aec]
frame #15: python() [0x5d1ca7]
frame #16: python() [0x5a605d]
frame #17: python() [0x5ebbd8]
frame #18: python() [0x542918]
frame #19: python() [0x54296a]
<omitting python frames>
frame #25: __libc_start_main + 0xf3 (0x7f22e8ea40b3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
Aborted (core dumped)
Any ideas of what’s going on here? Thank you!