RuntimeError: CUDA error: an illegal instruction was encountered

I’m trying to train a model on an EC2 instance with 4 V100 16GB gpus. I’m training the model using fully sharded data parallel. At first this was working fine, but after a while I started getting the error:

Batch 1186 | Loss: -52.208 | Time: 0.398

Batch 1187 | Loss: -50.720 | Time: 0.396

Batch 1188 | Loss: -49.351 | Time: 0.396

Batch 1189 | Loss: -46.431 | Time: 0.400

Batch 1190 | Loss: -48.643 | Time: 0.400

Batch 1191 | Loss: -53.842 | Time: 0.401

Batch 1192 | Loss: -55.080 | Time: 0.395

Traceback (most recent call last):
  File "/home/petar/SepFormer-FSDP.py", line 384, in <module>
    mp.spawn(fsdp_main,
  File "/home/petar/venv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 246, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
  File "/home/petar/venv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 202, in start_processes
    while not context.join():
  File "/home/petar/venv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 163, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/home/petar/venv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 74, in _wrap
    fn(i, *args)
  File "/home/petar/SepFormer-FSDP.py", line 329, in fsdp_main
    train(args, model, rank, world_size, train_loader, optimizer, epoch, sampler=sampler1)
  File "/home/petar/SepFormer-FSDP.py", line 261, in train
    mixed_signal = items[0].to(rank)
RuntimeError: CUDA error: an illegal instruction was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

As per the logs, the model is training normally for a while, after which it crashes. I’ve seen other questions that say that I should set the flag CUDA_LAUNCH_BLOCKING=1 in my script. And when I do that the training is a bit slower, but it does not crash.

My dependencies:
NVIDIA driver: 525.147.05
torch==2.1.1
torchaudio==2.1.1
torchvision==0.16.1
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.18.1
nvidia-nvjitlink-cu12==12.3.101
nvidia-nvtx-cu12==12.1.105

CPU info:

Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Byte Order:                         Little Endian
Address sizes:                      46 bits physical, 48 bits virtual
CPU(s):                             32
On-line CPU(s) list:                0-31
Thread(s) per core:                 2
Core(s) per socket:                 16
Socket(s):                          1
NUMA node(s):                       1
Vendor ID:                          GenuineIntel
CPU family:                         6
Model:                              79
Model name:                         Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Stepping:                           1
CPU MHz:                            3000.000
CPU max MHz:                        3000.0000
CPU min MHz:                        1200.0000
BogoMIPS:                           4600.01
Hypervisor vendor:                  Xen
Virtualization type:                full
L1d cache:                          512 KiB
L1i cache:                          512 KiB
L2 cache:                           4 MiB
L3 cache:                           45 MiB
NUMA node0 CPU(s):                  0-31
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        KVM: Mitigation: VMX unsupported
Vulnerability L1tf:                 Mitigation; PTE Inversion
Vulnerability Mds:                  Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Meltdown:             Mitigation; PTI
Vulnerability Mmio stale data:      Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Vulnerable
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology n
                                    onstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor l
                                    ahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx xsaveopt

Could you run your use case with compute-sanitizer and post the log here assuming it’s able to detect the memory violation?

When I run my code with compute-sanitizer in the log file it gives me the output:

========= COMPUTE-SANITIZER
========= Error: No attachable process found. compute-sanitizer timed-out.
========= Default timeout can be adjusted with --launch-timeout. Awaiting target completion.

Another thing with this problem is that every now and then it will give me a different error such as

RuntimeError: CUDA error: an illegal memory access was encountered

or

RuntimeError: CUDA error: misaligned address
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.

I don’t understand this behaviour.

Try to create a CUDA coredump as described here on device exception and check the failing kernel with cuda-gdb.

This traceback when I set

os.environ['CUDA_ENABLE_COREDUMP_ON_EXCEPTION']='1'

is:

Traceback (most recent call last):
  File "/home/petar/SepFormer-FSDP.py", line 409, in <module>
    mp.spawn(fsdp_main,
  File "/home/petar/venv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 246, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
  File "/home/petar/venv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 202, in start_processes
    while not context.join():
  File "/home/petar/venv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 163, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 2 terminated with the following error:
Traceback (most recent call last):
  File "/home/petar/venv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 74, in _wrap
    fn(i, *args)
  File "/home/petar/SepFormer-FSDP.py", line 354, in fsdp_main
    train(args, model, rank, world_size, train_loader, optimizer, epoch, sampler=sampler1)
  File "/home/petar/SepFormer-FSDP.py", line 319, in train
    dist.all_reduce(ddp_loss, op=dist.ReduceOp.SUM)
  File "/home/petar/venv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/petar/venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2050, in all_reduce
    work = group.allreduce([tensor], opts)
RuntimeError: [2] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout
Exception raised from doWait at ../torch/csrc/distributed/c10d/TCPStore.cpp:445 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f68db12b617 in /home/petar/venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x68 (0x7f68db0e6a56 in /home/petar/venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x32c (0x7f690c0a630c in /home/petar/venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f690c0a7492 in /home/petar/venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::get(std::string const&) + 0x55 (0x7f690c0a78b5 in /home/petar/venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f690c05f101 in /home/petar/venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f690c05f101 in /home/petar/venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f690c05f101 in /home/petar/venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #8: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xb2 (0x7f68dc121802 in /home/petar/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #9: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector<c10::Device, std::allocator<c10::Device> > const&, c10d::OpType, int, bool) + 0x203 (0x7f68dc1271d3 in /home/petar/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #10: <unknown function> + 0xf199e7 (0x7f68dc1359e7 in /home/petar/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #11: c10d::ProcessGroupNCCL::allreduce_impl(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&) + 0x21 (0x7f68dc1376a1 in /home/petar/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #12: c10d::ProcessGroupNCCL::allreduce(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&) + 0x3a7 (0x7f68dc1392c7 in /home/petar/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #13: <unknown function> + 0x557c8d6 (0x7f690c0538d6 in /home/petar/venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #14: <unknown function> + 0x5587a43 (0x7f690c05ea43 in /home/petar/venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #15: <unknown function> + 0x5587b69 (0x7f690c05eb69 in /home/petar/venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #16: <unknown function> + 0x4bb176b (0x7f690b68876b in /home/petar/venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #17: <unknown function> + 0x4baf74c (0x7f690b68674c in /home/petar/venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #18: <unknown function> + 0x1904a88 (0x7f69083dba88 in /home/petar/venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #19: <unknown function> + 0x558d0de (0x7f690c0640de in /home/petar/venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #20: <unknown function> + 0x559b4fd (0x7f690c0724fd in /home/petar/venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #21: <unknown function> + 0xc43bd5 (0x7f691e659bd5 in /home/petar/venv/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #22: <unknown function> + 0x3eeac4 (0x7f691de04ac4 in /home/petar/venv/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #54: __libc_start_main + 0xf3 (0x7f695c5b7083 in /lib/x86_64-linux-gnu/libc.so.6)
frame #55: _start + 0x2e (0x559a3fdca09e in /home/petar/venv/bin/python3)
. This may indicate a possible application crash on rank 0 or a network set up issue.

/home/petar/.pyenv/versions/3.10.12/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 20 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

EDIT: The aforementioned errors that I was facing before, just stopped. I could not reproduce them again. All of the sudden the training went trough all of the data, but this time it crashed in the all_reduce stage.