CUDA error: unspecified launch failure and NCCL issues

Distributed training is not working for several months now. For around 1.5 days code runs fine then fails with following message. I use CUDA 12.4 pytorch 2.5.1 with accelerate to do multi-gpu training with c10d backend and num_workers=0 in dataloader.

An error occurred during training: CUDA error: unspecified launch failure
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Traceback (most recent call last):
  File "/scratch/skatar/PreTraining/pretrain_iter.py", line 551, in do_pretrain
    out = model.forward(batch["ppg_segments"], apply_mask=True)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/skatar/anaconda3/envs/tmp6/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1639, in forward
    inputs, kwargs = self._pre_forward(*inputs, **kwargs)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/skatar/anaconda3/envs/tmp6/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1511, in _pre_forward
    self.logger.set_runtime_stats_and_log()
RuntimeError: CUDA error: unspecified launch failure
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


node00:86079:311684 [1] misc/strongstream.cc:395 NCCL WARN Cuda failure 'unspecified launch failure'
node00:86079:311684 [1] NCCL INFO init.cc:1952 -> 1

node00:86079:311684 [1] init.cc:2083 NCCL WARN commReclaim: comm 0xc1da940 (rank = 1) in abort, error 1
node00:86079:86307 [1] NCCL INFO [Service thread] Connection closed by localRank 1

node00:86079:86307 [1] include/alloc.h:125 NCCL WARN Cuda failure 719 'unspecified launch failure'
node00:86079:86307 [1] NCCL INFO include/alloc.h:246 -> 1

node00:86079:86307 [1] include/alloc.h:125 NCCL WARN Cuda failure 719 'unspecified launch failure'
node00:86079:86307 [1] NCCL INFO include/alloc.h:246 -> 1

node00:86079:86307 [1] include/alloc.h:125 NCCL WARN Cuda failure 719 'unspecified launch failure'
node00:86079:86307 [1] NCCL INFO include/alloc.h:246 -> 1

node00:86079:86307 [1] include/alloc.h:125 NCCL WARN Cuda failure 719 'unspecified launch failure'
node00:86079:86307 [1] NCCL INFO include/alloc.h:246 -> 1

node00:86079:86307 [1] include/alloc.h:125 NCCL WARN Cuda failure 719 'unspecified launch failure'
node00:86079:86307 [1] NCCL INFO include/alloc.h:246 -> 1

node00:86079:86307 [1] include/alloc.h:125 NCCL WARN Cuda failure 719 'unspecified launch failure'
node00:86079:86307 [1] NCCL INFO include/alloc.h:246 -> 1

node00:86079:86307 [1] include/alloc.h:125 NCCL WARN Cuda failure 719 'unspecified launch failure'
node00:86079:86307 [1] NCCL INFO include/alloc.h:246 -> 1

node00:86079:86307 [1] include/alloc.h:125 NCCL WARN Cuda failure 719 'unspecified launch failure'
node00:86079:86307 [1] NCCL INFO include/alloc.h:246 -> 1

node00:86079:86307 [1] include/alloc.h:125 NCCL WARN Cuda failure 719 'unspecified launch failure'
node00:86079:86307 [1] NCCL INFO include/alloc.h:246 -> 1

node00:86079:86307 [1] include/alloc.h:125 NCCL WARN Cuda failure 719 'unspecified launch failure'
node00:86079:86307 [1] NCCL INFO include/alloc.h:246 -> 1

node00:86079:86307 [1] include/alloc.h:125 NCCL WARN Cuda failure 719 'unspecified launch failure'
node00:86079:86307 [1] NCCL INFO include/alloc.h:246 -> 1

node00:86079:86307 [1] include/alloc.h:125 NCCL WARN Cuda failure 719 'unspecified launch failure'
node00:86079:86307 [1] NCCL INFO include/alloc.h:246 -> 1

node00:86079:86307 [1] include/alloc.h:125 NCCL WARN Cuda failure 719 'unspecified launch failure'
node00:86079:86307 [1] NCCL INFO include/alloc.h:246 -> 1
node00:86079:86307 [1] NCCL INFO transport/net.cc:541 -> 1
node00:86079:86307 [1] NCCL INFO transport/net.cc:944 -> 1
node00:86079:86307 [1] NCCL INFO proxy.cc:984 -> 1
node00:86079:86307 [1] NCCL INFO proxy.cc:1000 -> 1
node00:86079:311684 [1] include/alloc.h:125 NCCL WARN Cuda failure 719 'unspecified launch failure'
node00:86079:311684 [1] NCCL INFO include/alloc.h:246 -> 1
node00:86079:311684 [1] NCCL INFO transport/p2p.cc:541 -> 1
node00:86079:311684 [1] NCCL INFO channel.cc:158 -> 1
node00:86079:311684 [1] NCCL INFO init.cc:210 -> 1
node00:86079:311684 [1] NCCL INFO init.cc:1986 -> 1

node00:86079:311684 [1] init.cc:2118 NCCL WARN commReclaim: cleanup comm 0xc1da940 rank 1 failed in destroy/abort, error 1
node00:86079:311684 [1] NCCL INFO comm 0xc1da940 rank 1 nranks 4 cudaDev 1 busId 38000 - Abort COMPLETE
[rank1]: Traceback (most recent call last):
[rank1]:   File "/scratch/skatar/PreTraining/pretrain_iter.py", line 678, in <module>
[rank1]:     fire.Fire()   # enables easy commonda line interface
[rank1]:     ^^^^^^^^^^^
[rank1]:   File "/home/skatar/anaconda3/envs/tmp6/lib/python3.11/site-packages/fire/core.py", line 135, in Fire
[rank1]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank1]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/skatar/anaconda3/envs/tmp6/lib/python3.11/site-packages/fire/core.py", line 468, in _Fire
[rank1]:     component, remaining_args = _CallAndUpdateTrace(
[rank1]:                                 ^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/skatar/anaconda3/envs/tmp6/lib/python3.11/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
[rank1]:     component = fn(*varargs, **kwargs)
[rank1]:                 ^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/scratch/skatar/PreTraining/pretrain_iter.py", line 551, in do_pretrain
[rank1]:     out = model.forward(batch["ppg_segments"], apply_mask=True)
[rank1]:           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/skatar/anaconda3/envs/tmp6/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1639, in forward
[rank1]:     inputs, kwargs = self._pre_forward(*inputs, **kwargs)
[rank1]:                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/skatar/anaconda3/envs/tmp6/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1511, in _pre_forward
[rank1]:     self.logger.set_runtime_stats_and_log()
[rank1]: RuntimeError: CUDA error: unspecified launch failure
[rank1]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
[W1109 01:23:24.918889450 CUDAGuardImpl.h:119] Warning: CUDA warning: unspecified launch failure (function destroyEvent)
Stack dump without symbol names (ensure you have llvm-symbolizer in your PATH or set the environment var `LLVM_SYMBOLIZER_PATH` to point to it):
0  libtriton.so       0x00001530fd461388
1  libtriton.so       0x00001530f999db40
2  libtriton.so       0x00001530fd45eeac
3  libtriton.so       0x00001530fd461a3d
4  libc.so.6          0x00001531e9d546f0
5  libc.so.6          0x00001531e9da194c
6  libc.so.6          0x00001531e9d54646 raise + 22
7  libc.so.6          0x00001531e9d3e7f3 abort + 211
8  libstdc++.so.6     0x000015318f9f235a
9  libstdc++.so.6     0x000015318f9f13b9
10 libstdc++.so.6     0x000015318f9f1ae7 __gxx_personality_v0 + 135
11 libgcc_s.so.1      0x000015318f937dee
12 libgcc_s.so.1      0x000015318f9383a4 _Unwind_Resume + 101
13 libc10_cuda.so     0x000015318fea9269
14 libc10_cuda.so     0x000015318feb5c5f
15 libtorch_python.so 0x00001531e106ddd0
16 libc10.so          0x000015318fde169f
17 libc10.so          0x000015318fdda37b c10::TensorImpl::~TensorImpl() + 539
18 libc10.so          0x000015318fdda529 c10::TensorImpl::~TensorImpl() + 9
19 libtorch_cpu.so    0x00001531d8d0c3b4 c10d::Reducer::~Reducer() + 1476
20 libtorch_python.so 0x00001531e181eb02 std::_Sp_counted_ptr<c10d::Reducer*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 18
21 libtorch_python.so 0x00001531e0f30068 std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 72
22 libtorch_python.so 0x00001531e1829b31
23 libtorch_python.so 0x00001531e0f3b743
24 libtorch_python.so 0x00001531e0f3c2c1
25 python             0x000000000050302a
26 python             0x000000000055effe
27 python             0x0000000000541235
28 python             0x000000000053fc68
29 python             0x000000000053fca4
30 python             0x000000000053fca4
31 python             0x000000000053fca4
32 python             0x000000000053fca4
33 python             0x00000000004f76bb
34 python             0x00000000004fb10b PyDict_SetItemString + 171
35 python             0x00000000005fcb64
36 python             0x00000000005eb943 Py_FinalizeEx + 323
37 python             0x00000000005f70e3 Py_RunMain + 403
38 python             0x00000000005bbf89 Py_BytesMain + 57
39 libc.so.6          0x00001531e9d3f590
40 libc.so.6          0x00001531e9d3f640 __libc_start_main + 128
41 python             0x00000000005bbdd3
W1109 01:23:36.708000 86070 /mnt/beegfs/home/skatar/anaconda3/envs/tmp6/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 86078 closing signal SIGTERM
W1109 01:23:36.717000 86070 /mnt/beegfs/home/skatar/anaconda3/envs/tmp6/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 86081 closing signal SIGTERM
W1109 01:23:36.750000 86070 /mnt/beegfs/home/skatar/anaconda3/envs/tmp6/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 86084 closing signal SIGTERM
E1109 01:23:53.419000 86070 /mnt/beegfs/home/skatar/anaconda3/envs/tmp6/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -6) local_rank: 1 (pid: 86079) of binary: /home/skatar/anaconda3/envs/tmp6/bin/python
Traceback (most recent call last):
  File "/home/skatar/anaconda3/envs/tmp6/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/skatar/anaconda3/envs/tmp6/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/home/skatar/anaconda3/envs/tmp6/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1159, in launch_command
    multi_gpu_launcher(args)
  File "/home/skatar/anaconda3/envs/tmp6/lib/python3.11/site-packages/accelerate/commands/launch.py", line 793, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/skatar/anaconda3/envs/tmp6/lib/python3.11/site-packages/torch/distributed/run.py", line 910, in run
    elastic_launch(
  File "/home/skatar/anaconda3/envs/tmp6/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/skatar/anaconda3/envs/tmp6/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
======================================================
pretrain_iter.py FAILED
------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-11-09_01:23:36
  host      : node00.cluster
  rank      : 1 (local_rank: 1)
  exitcode  : -6 (pid: 86079)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 86079
======================================================

Do you see the same issue when running workloads on a single GPU?
Did you also check for any Xids in dmesg?

I see messages like these after the issue arises for the first time.

[186913.722882] NVRM: Xid (PCI:0000:38:00): 62, pid='<unknown>', name=<unknown>, 2025c642 2025c830 2025ce2a 2025c9a2 2025ccfe 2025cb8e 00000000 00000000
[186913.777004] NVRM: Xid (PCI:0000:38:00): 45, pid=86079, name=python, Ch 00000009
[186913.864533] NVRM: Xid (PCI:0000:38:00): 45, pid=86079, name=python, Ch 0000000a
[186913.884990] NVRM: Xid (PCI:0000:38:00): 45, pid=86079, name=python, Ch 0000000b
[186913.942895] NVRM: Xid (PCI:0000:38:00): 45, pid=86079, name=python, Ch 0000000c
[186913.987938] NVRM: Xid (PCI:0000:38:00): 45, pid=86079, name=python, Ch 0000000d
[186914.045990] NVRM: Xid (PCI:0000:38:00): 45, pid=86079, name=python, Ch 0000000e
[186914.089110] NVRM: Xid (PCI:0000:38:00): 45, pid=86079, name=python, Ch 0000000f
[186914.107930] NVRM: Xid (PCI:0000:38:00): 45, pid=86079, name=python, Ch 00000010

Also, after the initial error, only 1 GPU works. Others gives following error

In [1]: import torch; torch.cuda.is_available()
/home/skatar/anaconda3/envs/tmp6/lib/python3.11/site-packages/torch/cuda/__init__.py:129: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at /opt/conda/conda-bld/pytorch_1729647429097/work/c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0
Out[1]: False

I will update on how long the only remaining GPU is able to run my job.

Xids are described here and it seems the issue starts with 62:

Internal micro-controller halt (newer drivers)

Thanks for these useful links. Xid 62 says HW error, driver error, or thermal issues. We have already tried using a newer driver so it must be hardware issue. I don’t know how to find if issue is with specific GPUs or their interconnection. Because some GPUs are able to run code alone but not as a group. I also see following info in nvidia-smi -q.

GPU Reset Status
    Reset Required                    : Yes
    Drain and Reset Recommended       : No
Clocks Event Reasons                  : System is not in ready state
Clocks
    Graphics                          : System is not in ready state
    SM                                : System is not in ready state
    Memory                            : System is not in ready state
    Video                             : System is not in ready state