Summarize the reasons for the common error: "Illegal Memory Access"

Hi everyone,
I come across the following error occasionally:

File “/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/mmcv/runner/hooks/optimizer.py”, line 259, in after_train_iter
scaled_loss.backward()
File “/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torch/tensor.py”, line 195, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File “/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torch/autograd/init.py”, line 99, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: transform: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
terminate called after throwing an instance of ‘c10::Error’
what(): CUDA error: an illegal memory access was encountered (insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:764)

The occasion of this error is random. It might be raised after several epochs of training, it’s time-consuming to reproduce the error. I write the code with strict assertions to keep the variables reasonable, so there should not be some trivial errors like out of range.

I ask for help for the following things:

  1. I hope someone can help me to summarize the possible reasons (no need to be very specific) for this error.
  2. Is it possible that running out of memory may cause this error?

I will appreciate it if you guys give me help.

  1. Any invalid memory read or write can cause this issue. The mentioned out of range indexing is a common issue, which could also be caused by e.g. race conditions etc.

  2. I don’t think it can (at least this would be the first time an OOM is causing the memory violation).