CUDA error: an illegal instruction was encountered

a distributed training crashes with the following errors. Normally it works well, but sometimes it crashes with the following errors.

Any idea to resolve it?

pytorch 1.5, sync-bn is used, each GPU’s input has different dimensions.

  File "/opt/conda/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/_initialize.py", line 197, in new_fwd
    **applier(kwargs, input_caster))
  File "/tmp/code/quickdetection/src/FCOS/fcos_core/modeling/detector/generalized_rcnn.py", line 49, in forward
    features = self.backbone(images.tensors)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/tmp/code/quickdetection/src/qd/layers/efficient_det.py", line 1221, in forward
    _, p3, p4, p5 = self.backbone_net(inputs)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/tmp/code/quickdetection/src/qd/layers/efficient_det.py", line 1067, in forward
    x = self.model._bn0(x)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/batchnorm.py", line 472, in forward
    self.eps, exponential_average_factor, process_group, world_size)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/_functions.py", line 46, in forward
    count_all.view(-1).long().tolist()
RuntimeError: CUDA error: an illegal instruction was encountered
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal instruction was encountered (insert_events at /opt/conda/conda-bld/pytorch_1591914742272/work/c10/cuda/CUDACachingAllocator.cpp:771)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x4e (0x7fe430441b5e in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x6d0 (0x7fe430686e30 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7fe43042f6ed in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x51e58a (0x7fe45da9358a in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #31: __libc_start_main + 0xf0 (0x7fe4783c6830 in /lib/x86_64-linux-gnu/libc.so.6)

Could you update to PyTorch 1.5.1, as 1.5.0 had a bug where internal assert statements were ignored?
This should hopefully yield a better error message than the illegal memory access.

With Pytorch 1.6 / CUDA 10.2 /CUDNN7, I got following error occasionally:

Traceback (most recent call last):
  File "train.py", line 212, in <module>
    train(None)
  File "/gemfield/hostpv/gemfield/deepvac/lib/syszux_deepvac.py", line 335, in __call__
    self.process()
  File "train.py", line 163, in process
    self.processTrain()
  File "/gemfield/hostpv/gemfield/deepvac/lib/syszux_deepvac.py", line 294, in processTrain
    self.doBackward()
  File "train.py", line 139, in doBackward
    self.loss.backward()
  File "/opt/conda/lib/python3.7/site-packages/torch/tensor.py", line 185, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/opt/conda/lib/python3.7/site-packages/torch/autograd/__init__.py", line 127, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: transform: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1595629403081/work/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7fb3e291677d in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xb5d (0x7fb3e2b66d9d in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7fb3e2902b1d in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x53f0ea (0x7fb41c1990ea in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #17: __libc_start_main + 0xe7 (0x7fb442bdfb97 in /lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)

Don’t know it is a hardward issue,driver issue or pytorch issue?

Could you post a minimal code snippet to reproduce this issue as well as your currently installed NVIDIA driver and the GPU you are using?

I got the same error with Pytorch1.6 for CUDA10.2 on Ubuntu 18.04

1 Like

Could you post a minimal code snippet as given in my previous post, so that we could have a look at this issue?

I also encountered similar issues with pytorch 1.6 with ubuntu 18 or ubuntu 16; cuda 10.1 or cuda 10.2. it works fine with pytorch 1.5.1, but this issues occurs occasionally with pytorch 1.6

Could you rerun the code with CUDA_LAUNCH_BLOCKING=1 python script.py args and post the stack trace here?

thanks for your reply. however, this is random. roughly, 10% of the times, it will happen. Recently, i find pytorch 1.5.1 also has this issue. Note, in the following trace, CUDA_LAUNCH_BLOCKING is not set as 1. Paste it here and hopefully it can have some information.

another case. It seems like the error message is also random.

Could you post the stack traces by wrapping them into three backticks ``` please?
If you don’t set CUDA_LAUNCH_BLOCKING=1, the stack trace might point to random lines of code.

Hi I got same issue. My pytorch version is 2.0.0, cuda version is 12.2, and the issue happens randomly.Could any one help me?any ideas?
The massage is blow:

RuntimeError Traceback (most recent call last)
Cell In[14], line 1
----> 1 model = AutoModelForCausalLM.from_pretrained(“Qwen/Qwen-7B-Chat”, device_map=“auto”, trust_remote_code=True).eval()

File ~\AppData\Roaming\Python\Python310\site-packages\transformers\models\auto\auto_factory.py:488, in _BaseAutoModelClass.from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
486 else:
487 cls.register(config.class, model_class, exist_ok=True)
→ 488 return model_class.from_pretrained(
489 pretrained_model_name_or_path, *model_args, config=config, **hub_kwargs, **kwargs
490 )
491 elif type(config) in cls._model_mapping.keys():
492 model_class = _get_model_class(config, cls._model_mapping)

File ~\AppData\Roaming\Python\Python310\site-packages\transformers\modeling_utils.py:2824, in PreTrainedModel.from_pretrained(cls, pretrained_model_name_or_path, config, cache_dir, ignore_mismatched_sizes, force_download, local_files_only, token, revision, use_safetensors, *model_args, **kwargs)
2819 logger.warn(
2820 "This model has some weights that should be kept in higher precision, you need to upgrade "
2821 “accelerate to properly deal with them (pip install --upgrade accelerate).”
2822 )
2823 if device_map != “sequential”:
→ 2824 max_memory = get_balanced_memory(
2825 model,
2826 dtype=target_dtype,
2827 low_zero=(device_map == “balanced_low_0”),
2828 max_memory=max_memory,
2829 **kwargs,
2830 )
2831 kwargs[“max_memory”] = max_memory
2832 # Make sure tied weights are tied before creating the device map.

File ~\AppData\Roaming\Python\Python310\site-packages\accelerate\utils\modeling.py:731, in get_balanced_memory(model, max_memory, no_split_module_classes, dtype, special_dtypes, low_zero)
703 “”"
704 Compute a max_memory dictionary for [infer_auto_device_map] that will balance the use of each available GPU.
705
(…)
728 Transformers generate function).
729 “”"
730 # Get default / clean up max_memory
→ 731 max_memory = get_max_memory(max_memory)
733 if not (torch.cuda.is_available() or is_xpu_available()) or is_mps_available():
734 return max_memory

File ~\AppData\Roaming\Python\Python310\site-packages\accelerate\utils\modeling.py:624, in get_max_memory(max_memory)
622 if not is_xpu_available():
623 for i in range(torch.cuda.device_count()):
→ 624 _ = torch.tensor([0], device=i)
625 max_memory = {i: torch.cuda.mem_get_info(i)[0] for i in range(torch.cuda.device_count())}
626 else:

RuntimeError: CUDA error: an illegal instruction was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.