Hi everyone,
I come across the following error occasionally:
File “/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/mmcv/runner/hooks/optimizer.py”, line 259, in after_train_iter
scaled_loss.backward()
File “/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torch/tensor.py”, line 195, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File “/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torch/autograd/init.py”, line 99, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: transform: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
terminate called after throwing an instance of ‘c10::Error’
what(): CUDA error: an illegal memory access was encountered (insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:764)
The occasion of this error is random. It might be raised after several epochs of training, it’s time-consuming to reproduce the error. I write the code with strict assertions to keep the variables reasonable, so there should not be some trivial errors like out of range
.
I ask for help for the following things:
- I hope someone can help me to summarize the possible reasons (no need to be very specific) for this error.
- Is it possible that running out of memory may cause this error?
I will appreciate it if you guys give me help.