a distributed training crashes with the following errors. Normally it works well, but sometimes it crashes with the following errors.
Any idea to resolve it?
pytorch 1.5, sync-bn is used, each GPU’s input has different dimensions.
File "/opt/conda/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/_initialize.py", line 197, in new_fwd
**applier(kwargs, input_caster))
File "/tmp/code/quickdetection/src/FCOS/fcos_core/modeling/detector/generalized_rcnn.py", line 49, in forward
features = self.backbone(images.tensors)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/tmp/code/quickdetection/src/qd/layers/efficient_det.py", line 1221, in forward
_, p3, p4, p5 = self.backbone_net(inputs)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/tmp/code/quickdetection/src/qd/layers/efficient_det.py", line 1067, in forward
x = self.model._bn0(x)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/batchnorm.py", line 472, in forward
self.eps, exponential_average_factor, process_group, world_size)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/_functions.py", line 46, in forward
count_all.view(-1).long().tolist()
RuntimeError: CUDA error: an illegal instruction was encountered
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal instruction was encountered (insert_events at /opt/conda/conda-bld/pytorch_1591914742272/work/c10/cuda/CUDACachingAllocator.cpp:771)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x4e (0x7fe430441b5e in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x6d0 (0x7fe430686e30 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7fe43042f6ed in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x51e58a (0x7fe45da9358a in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #31: __libc_start_main + 0xf0 (0x7fe4783c6830 in /lib/x86_64-linux-gnu/libc.so.6)
Could you update to PyTorch 1.5.1, as 1.5.0 had a bug where internal assert statements were ignored?
This should hopefully yield a better error message than the illegal memory access.
With Pytorch 1.6 / CUDA 10.2 /CUDNN7, I got following error occasionally:
Traceback (most recent call last):
File "train.py", line 212, in <module>
train(None)
File "/gemfield/hostpv/gemfield/deepvac/lib/syszux_deepvac.py", line 335, in __call__
self.process()
File "train.py", line 163, in process
self.processTrain()
File "/gemfield/hostpv/gemfield/deepvac/lib/syszux_deepvac.py", line 294, in processTrain
self.doBackward()
File "train.py", line 139, in doBackward
self.loss.backward()
File "/opt/conda/lib/python3.7/site-packages/torch/tensor.py", line 185, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/opt/conda/lib/python3.7/site-packages/torch/autograd/__init__.py", line 127, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: transform: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1595629403081/work/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7fb3e291677d in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xb5d (0x7fb3e2b66d9d in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7fb3e2902b1d in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x53f0ea (0x7fb41c1990ea in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #17: __libc_start_main + 0xe7 (0x7fb442bdfb97 in /lib/x86_64-linux-gnu/libc.so.6)
Aborted (core dumped)
Don’t know it is a hardward issue,driver issue or pytorch issue?
I also encountered similar issues with pytorch 1.6 with ubuntu 18 or ubuntu 16; cuda 10.1 or cuda 10.2. it works fine with pytorch 1.5.1, but this issues occurs occasionally with pytorch 1.6
thanks for your reply. however, this is random. roughly, 10% of the times, it will happen. Recently, i find pytorch 1.5.1 also has this issue. Note, in the following trace, CUDA_LAUNCH_BLOCKING is not set as 1. Paste it here and hopefully it can have some information.
Could you post the stack traces by wrapping them into three backticks ``` please?
If you don’t set CUDA_LAUNCH_BLOCKING=1, the stack trace might point to random lines of code.
Hi I got same issue. My pytorch version is 2.0.0, cuda version is 12.2, and the issue happens randomly.Could any one help me?any ideas?
The massage is blow:
RuntimeError Traceback (most recent call last)
Cell In[14], line 1
----> 1 model = AutoModelForCausalLM.from_pretrained(“Qwen/Qwen-7B-Chat”, device_map=“auto”, trust_remote_code=True).eval()
File ~\AppData\Roaming\Python\Python310\site-packages\transformers\modeling_utils.py:2824, in PreTrainedModel.from_pretrained(cls, pretrained_model_name_or_path, config, cache_dir, ignore_mismatched_sizes, force_download, local_files_only, token, revision, use_safetensors, *model_args, **kwargs)
2819 logger.warn(
2820 "This model has some weights that should be kept in higher precision, you need to upgrade "
2821 “accelerate to properly deal with them (pip install --upgrade accelerate).”
2822 )
2823 if device_map != “sequential”:
→ 2824 max_memory = get_balanced_memory(
2825 model,
2826 dtype=target_dtype,
2827 low_zero=(device_map == “balanced_low_0”),
2828 max_memory=max_memory,
2829 **kwargs,
2830 )
2831 kwargs[“max_memory”] = max_memory
2832 # Make sure tied weights are tied before creating the device map.
File ~\AppData\Roaming\Python\Python310\site-packages\accelerate\utils\modeling.py:731, in get_balanced_memory(model, max_memory, no_split_module_classes, dtype, special_dtypes, low_zero)
703 “”"
704 Compute a max_memory dictionary for [infer_auto_device_map] that will balance the use of each available GPU.
705
(…)
728 Transformers generate function).
729 “”"
730 # Get default / clean up max_memory
→ 731 max_memory = get_max_memory(max_memory)
733 if not (torch.cuda.is_available() or is_xpu_available()) or is_mps_available():
734 return max_memory
File ~\AppData\Roaming\Python\Python310\site-packages\accelerate\utils\modeling.py:624, in get_max_memory(max_memory)
622 if not is_xpu_available():
623 for i in range(torch.cuda.device_count()):
→ 624 _ = torch.tensor([0], device=i)
625 max_memory = {i: torch.cuda.mem_get_info(i)[0] for i in range(torch.cuda.device_count())}
626 else:
RuntimeError: CUDA error: an illegal instruction was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.