CUDA error: an illegal instruction was encountered

amsword · July 8, 2020, 4:53am

a distributed training crashes with the following errors. Normally it works well, but sometimes it crashes with the following errors.

Any idea to resolve it?

pytorch 1.5, sync-bn is used, each GPU’s input has different dimensions.

  File "/opt/conda/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/_initialize.py", line 197, in new_fwd
    **applier(kwargs, input_caster))
  File "/tmp/code/quickdetection/src/FCOS/fcos_core/modeling/detector/generalized_rcnn.py", line 49, in forward
    features = self.backbone(images.tensors)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/tmp/code/quickdetection/src/qd/layers/efficient_det.py", line 1221, in forward
    _, p3, p4, p5 = self.backbone_net(inputs)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/tmp/code/quickdetection/src/qd/layers/efficient_det.py", line 1067, in forward
    x = self.model._bn0(x)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/batchnorm.py", line 472, in forward
    self.eps, exponential_average_factor, process_group, world_size)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/_functions.py", line 46, in forward
    count_all.view(-1).long().tolist()
RuntimeError: CUDA error: an illegal instruction was encountered
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal instruction was encountered (insert_events at /opt/conda/conda-bld/pytorch_1591914742272/work/c10/cuda/CUDACachingAllocator.cpp:771)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x4e (0x7fe430441b5e in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x6d0 (0x7fe430686e30 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7fe43042f6ed in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x51e58a (0x7fe45da9358a in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #31: __libc_start_main + 0xf0 (0x7fe4783c6830 in /lib/x86_64-linux-gnu/libc.so.6)

ptrblck · July 8, 2020, 10:16am

Could you update to PyTorch 1.5.1, as 1.5.0 had a bug where internal assert statements were ignored?
This should hopefully yield a better error message than the illegal memory access.

gemfield · August 14, 2020, 2:54am

With Pytorch 1.6 / CUDA 10.2 /CUDNN7, I got following error occasionally:

Traceback (most recent call last):
  File "train.py", line 212, in <module>
    train(None)
  File "/gemfield/hostpv/gemfield/deepvac/lib/syszux_deepvac.py", line 335, in __call__
    self.process()
  File "train.py", line 163, in process
    self.processTrain()
  File "/gemfield/hostpv/gemfield/deepvac/lib/syszux_deepvac.py", line 294, in processTrain
    self.doBackward()
  File "train.py", line 139, in doBackward
    self.loss.backward()
  File "/opt/conda/lib/python3.7/site-packages/torch/tensor.py", line 185, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/opt/conda/lib/python3.7/site-packages/torch/autograd/__init__.py", line 127, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: transform: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1595629403081/work/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7fb3e291677d in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xb5d (0x7fb3e2b66d9d in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7fb3e2902b1d in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x53f0ea (0x7fb41c1990ea in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #17: __libc_start_main + 0xe7 (0x7fb442bdfb97 in /lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)

Don’t know it is a hardward issue，driver issue or pytorch issue？

ptrblck · August 14, 2020, 3:49am

Could you post a minimal code snippet to reproduce this issue as well as your currently installed NVIDIA driver and the GPU you are using?

Xfan1025 · August 15, 2020, 8:56am

I got the same error with Pytorch1.6 for CUDA10.2 on Ubuntu 18.04

ptrblck · August 18, 2020, 6:38am

Could you post a minimal code snippet as given in my previous post, so that we could have a look at this issue?

amsword · August 29, 2020, 4:13am

I also encountered similar issues with pytorch 1.6 with ubuntu 18 or ubuntu 16; cuda 10.1 or cuda 10.2. it works fine with pytorch 1.5.1, but this issues occurs occasionally with pytorch 1.6

ptrblck · August 29, 2020, 4:58am

Could you rerun the code with CUDA_LAUNCH_BLOCKING=1 python script.py args and post the stack trace here?

amsword · August 30, 2020, 12:53am

thanks for your reply. however, this is random. roughly, 10% of the times, it will happen. Recently, i find pytorch 1.5.1 also has this issue. Note, in the following trace, CUDA_LAUNCH_BLOCKING is not set as 1. Paste it here and hopefully it can have some information.

amsword · August 30, 2020, 12:59am

another case. It seems like the error message is also random.

ptrblck · August 30, 2020, 2:28am

Could you post the stack traces by wrapping them into three backticks ``` please?
If you don’t set CUDA_LAUNCH_BLOCKING=1, the stack trace might point to random lines of code.

Bill0201 · August 16, 2023, 2:18am

Hi I got same issue. My pytorch version is 2.0.0, cuda version is 12.2, and the issue happens randomly.Could any one help me?any ideas?
The massage is blow:

RuntimeError Traceback (most recent call last)
Cell In[14], line 1
----> 1 model = AutoModelForCausalLM.from_pretrained(“Qwen/Qwen-7B-Chat”, device_map=“auto”, trust_remote_code=True).eval()

File ~\AppData\Roaming\Python\Python310\site-packages\transformers\models\auto\auto_factory.py:488, in _BaseAutoModelClass.from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
486 else:
487 cls.register(config.class, model_class, exist_ok=True)
→ 488 return model_class.from_pretrained(
489 pretrained_model_name_or_path, *model_args, config=config, **hub_kwargs, **kwargs
490 )
491 elif type(config) in cls._model_mapping.keys():
492 model_class = _get_model_class(config, cls._model_mapping)

File ~\AppData\Roaming\Python\Python310\site-packages\transformers\modeling_utils.py:2824, in PreTrainedModel.from_pretrained(cls, pretrained_model_name_or_path, config, cache_dir, ignore_mismatched_sizes, force_download, local_files_only, token, revision, use_safetensors, *model_args, **kwargs)
2819 logger.warn(
2820 "This model has some weights that should be kept in higher precision, you need to upgrade "
2821 “accelerate to properly deal with them (pip install --upgrade accelerate).”
2822 )
2823 if device_map != “sequential”:
→ 2824 max_memory = get_balanced_memory(
2825 model,
2826 dtype=target_dtype,
2827 low_zero=(device_map == “balanced_low_0”),
2828 max_memory=max_memory,
2829 **kwargs,
2830 )
2831 kwargs[“max_memory”] = max_memory
2832 # Make sure tied weights are tied before creating the device map.

File ~\AppData\Roaming\Python\Python310\site-packages\accelerate\utils\modeling.py:731, in get_balanced_memory(model, max_memory, no_split_module_classes, dtype, special_dtypes, low_zero)
703 “”"
704 Compute a max_memory dictionary for [infer_auto_device_map] that will balance the use of each available GPU.
705
(…)
728 Transformers generate function).
729 “”"
730 # Get default / clean up max_memory
→ 731 max_memory = get_max_memory(max_memory)
733 if not (torch.cuda.is_available() or is_xpu_available()) or is_mps_available():
734 return max_memory

File ~\AppData\Roaming\Python\Python310\site-packages\accelerate\utils\modeling.py:624, in get_max_memory(max_memory)
622 if not is_xpu_available():
623 for i in range(torch.cuda.device_count()):
→ 624 _ = torch.tensor([0], device=i)
625 max_memory = {i: torch.cuda.mem_get_info(i)[0] for i in range(torch.cuda.device_count())}
626 else:

RuntimeError: CUDA error: an illegal instruction was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.