CUDA error: an illegal instruction was encountered

a distributed training crashes with the following errors. Normally it works well, but sometimes it crashes with the following errors.

Any idea to resolve it?

pytorch 1.5, sync-bn is used, each GPU’s input has different dimensions.

  File "/opt/conda/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/_initialize.py", line 197, in new_fwd
    **applier(kwargs, input_caster))
  File "/tmp/code/quickdetection/src/FCOS/fcos_core/modeling/detector/generalized_rcnn.py", line 49, in forward
    features = self.backbone(images.tensors)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/tmp/code/quickdetection/src/qd/layers/efficient_det.py", line 1221, in forward
    _, p3, p4, p5 = self.backbone_net(inputs)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/tmp/code/quickdetection/src/qd/layers/efficient_det.py", line 1067, in forward
    x = self.model._bn0(x)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/batchnorm.py", line 472, in forward
    self.eps, exponential_average_factor, process_group, world_size)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/_functions.py", line 46, in forward
    count_all.view(-1).long().tolist()
RuntimeError: CUDA error: an illegal instruction was encountered
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal instruction was encountered (insert_events at /opt/conda/conda-bld/pytorch_1591914742272/work/c10/cuda/CUDACachingAllocator.cpp:771)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x4e (0x7fe430441b5e in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x6d0 (0x7fe430686e30 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7fe43042f6ed in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x51e58a (0x7fe45da9358a in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #31: __libc_start_main + 0xf0 (0x7fe4783c6830 in /lib/x86_64-linux-gnu/libc.so.6)

Could you update to PyTorch 1.5.1, as 1.5.0 had a bug where internal assert statements were ignored?
This should hopefully yield a better error message than the illegal memory access.

With Pytorch 1.6 / CUDA 10.2 /CUDNN7, I got following error occasionally:

Traceback (most recent call last):
  File "train.py", line 212, in <module>
    train(None)
  File "/gemfield/hostpv/gemfield/deepvac/lib/syszux_deepvac.py", line 335, in __call__
    self.process()
  File "train.py", line 163, in process
    self.processTrain()
  File "/gemfield/hostpv/gemfield/deepvac/lib/syszux_deepvac.py", line 294, in processTrain
    self.doBackward()
  File "train.py", line 139, in doBackward
    self.loss.backward()
  File "/opt/conda/lib/python3.7/site-packages/torch/tensor.py", line 185, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/opt/conda/lib/python3.7/site-packages/torch/autograd/__init__.py", line 127, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: transform: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1595629403081/work/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7fb3e291677d in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xb5d (0x7fb3e2b66d9d in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7fb3e2902b1d in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x53f0ea (0x7fb41c1990ea in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #17: __libc_start_main + 0xe7 (0x7fb442bdfb97 in /lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)

Don’t know it is a hardward issue,driver issue or pytorch issue?

Could you post a minimal code snippet to reproduce this issue as well as your currently installed NVIDIA driver and the GPU you are using?

I got the same error with Pytorch1.6 for CUDA10.2 on Ubuntu 18.04

1 Like

Could you post a minimal code snippet as given in my previous post, so that we could have a look at this issue?

I also encountered similar issues with pytorch 1.6 with ubuntu 18 or ubuntu 16; cuda 10.1 or cuda 10.2. it works fine with pytorch 1.5.1, but this issues occurs occasionally with pytorch 1.6

Could you rerun the code with CUDA_LAUNCH_BLOCKING=1 python script.py args and post the stack trace here?

thanks for your reply. however, this is random. roughly, 10% of the times, it will happen. Recently, i find pytorch 1.5.1 also has this issue. Note, in the following trace, CUDA_LAUNCH_BLOCKING is not set as 1. Paste it here and hopefully it can have some information.

another case. It seems like the error message is also random.

Could you post the stack traces by wrapping them into three backticks ``` please?
If you don’t set CUDA_LAUNCH_BLOCKING=1, the stack trace might point to random lines of code.