Distributed training is not working for several months now. For around 1.5 days code runs fine then fails with following message. I use CUDA 12.4 pytorch 2.5.1 with accelerate to do multi-gpu training with c10d backend and num_workers=0 in dataloader.
An error occurred during training: CUDA error: unspecified launch failure
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Traceback (most recent call last):
File "/scratch/skatar/PreTraining/pretrain_iter.py", line 551, in do_pretrain
out = model.forward(batch["ppg_segments"], apply_mask=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/skatar/anaconda3/envs/tmp6/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1639, in forward
inputs, kwargs = self._pre_forward(*inputs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/skatar/anaconda3/envs/tmp6/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1511, in _pre_forward
self.logger.set_runtime_stats_and_log()
RuntimeError: CUDA error: unspecified launch failure
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
node00:86079:311684 [1] misc/strongstream.cc:395 NCCL WARN Cuda failure 'unspecified launch failure'
node00:86079:311684 [1] NCCL INFO init.cc:1952 -> 1
node00:86079:311684 [1] init.cc:2083 NCCL WARN commReclaim: comm 0xc1da940 (rank = 1) in abort, error 1
node00:86079:86307 [1] NCCL INFO [Service thread] Connection closed by localRank 1
node00:86079:86307 [1] include/alloc.h:125 NCCL WARN Cuda failure 719 'unspecified launch failure'
node00:86079:86307 [1] NCCL INFO include/alloc.h:246 -> 1
node00:86079:86307 [1] include/alloc.h:125 NCCL WARN Cuda failure 719 'unspecified launch failure'
node00:86079:86307 [1] NCCL INFO include/alloc.h:246 -> 1
node00:86079:86307 [1] include/alloc.h:125 NCCL WARN Cuda failure 719 'unspecified launch failure'
node00:86079:86307 [1] NCCL INFO include/alloc.h:246 -> 1
node00:86079:86307 [1] include/alloc.h:125 NCCL WARN Cuda failure 719 'unspecified launch failure'
node00:86079:86307 [1] NCCL INFO include/alloc.h:246 -> 1
node00:86079:86307 [1] include/alloc.h:125 NCCL WARN Cuda failure 719 'unspecified launch failure'
node00:86079:86307 [1] NCCL INFO include/alloc.h:246 -> 1
node00:86079:86307 [1] include/alloc.h:125 NCCL WARN Cuda failure 719 'unspecified launch failure'
node00:86079:86307 [1] NCCL INFO include/alloc.h:246 -> 1
node00:86079:86307 [1] include/alloc.h:125 NCCL WARN Cuda failure 719 'unspecified launch failure'
node00:86079:86307 [1] NCCL INFO include/alloc.h:246 -> 1
node00:86079:86307 [1] include/alloc.h:125 NCCL WARN Cuda failure 719 'unspecified launch failure'
node00:86079:86307 [1] NCCL INFO include/alloc.h:246 -> 1
node00:86079:86307 [1] include/alloc.h:125 NCCL WARN Cuda failure 719 'unspecified launch failure'
node00:86079:86307 [1] NCCL INFO include/alloc.h:246 -> 1
node00:86079:86307 [1] include/alloc.h:125 NCCL WARN Cuda failure 719 'unspecified launch failure'
node00:86079:86307 [1] NCCL INFO include/alloc.h:246 -> 1
node00:86079:86307 [1] include/alloc.h:125 NCCL WARN Cuda failure 719 'unspecified launch failure'
node00:86079:86307 [1] NCCL INFO include/alloc.h:246 -> 1
node00:86079:86307 [1] include/alloc.h:125 NCCL WARN Cuda failure 719 'unspecified launch failure'
node00:86079:86307 [1] NCCL INFO include/alloc.h:246 -> 1
node00:86079:86307 [1] include/alloc.h:125 NCCL WARN Cuda failure 719 'unspecified launch failure'
node00:86079:86307 [1] NCCL INFO include/alloc.h:246 -> 1
node00:86079:86307 [1] NCCL INFO transport/net.cc:541 -> 1
node00:86079:86307 [1] NCCL INFO transport/net.cc:944 -> 1
node00:86079:86307 [1] NCCL INFO proxy.cc:984 -> 1
node00:86079:86307 [1] NCCL INFO proxy.cc:1000 -> 1
node00:86079:311684 [1] include/alloc.h:125 NCCL WARN Cuda failure 719 'unspecified launch failure'
node00:86079:311684 [1] NCCL INFO include/alloc.h:246 -> 1
node00:86079:311684 [1] NCCL INFO transport/p2p.cc:541 -> 1
node00:86079:311684 [1] NCCL INFO channel.cc:158 -> 1
node00:86079:311684 [1] NCCL INFO init.cc:210 -> 1
node00:86079:311684 [1] NCCL INFO init.cc:1986 -> 1
node00:86079:311684 [1] init.cc:2118 NCCL WARN commReclaim: cleanup comm 0xc1da940 rank 1 failed in destroy/abort, error 1
node00:86079:311684 [1] NCCL INFO comm 0xc1da940 rank 1 nranks 4 cudaDev 1 busId 38000 - Abort COMPLETE
[rank1]: Traceback (most recent call last):
[rank1]: File "/scratch/skatar/PreTraining/pretrain_iter.py", line 678, in <module>
[rank1]: fire.Fire() # enables easy commonda line interface
[rank1]: ^^^^^^^^^^^
[rank1]: File "/home/skatar/anaconda3/envs/tmp6/lib/python3.11/site-packages/fire/core.py", line 135, in Fire
[rank1]: component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/skatar/anaconda3/envs/tmp6/lib/python3.11/site-packages/fire/core.py", line 468, in _Fire
[rank1]: component, remaining_args = _CallAndUpdateTrace(
[rank1]: ^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/skatar/anaconda3/envs/tmp6/lib/python3.11/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
[rank1]: component = fn(*varargs, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/scratch/skatar/PreTraining/pretrain_iter.py", line 551, in do_pretrain
[rank1]: out = model.forward(batch["ppg_segments"], apply_mask=True)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/skatar/anaconda3/envs/tmp6/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1639, in forward
[rank1]: inputs, kwargs = self._pre_forward(*inputs, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/skatar/anaconda3/envs/tmp6/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1511, in _pre_forward
[rank1]: self.logger.set_runtime_stats_and_log()
[rank1]: RuntimeError: CUDA error: unspecified launch failure
[rank1]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
[W1109 01:23:24.918889450 CUDAGuardImpl.h:119] Warning: CUDA warning: unspecified launch failure (function destroyEvent)
Stack dump without symbol names (ensure you have llvm-symbolizer in your PATH or set the environment var `LLVM_SYMBOLIZER_PATH` to point to it):
0 libtriton.so 0x00001530fd461388
1 libtriton.so 0x00001530f999db40
2 libtriton.so 0x00001530fd45eeac
3 libtriton.so 0x00001530fd461a3d
4 libc.so.6 0x00001531e9d546f0
5 libc.so.6 0x00001531e9da194c
6 libc.so.6 0x00001531e9d54646 raise + 22
7 libc.so.6 0x00001531e9d3e7f3 abort + 211
8 libstdc++.so.6 0x000015318f9f235a
9 libstdc++.so.6 0x000015318f9f13b9
10 libstdc++.so.6 0x000015318f9f1ae7 __gxx_personality_v0 + 135
11 libgcc_s.so.1 0x000015318f937dee
12 libgcc_s.so.1 0x000015318f9383a4 _Unwind_Resume + 101
13 libc10_cuda.so 0x000015318fea9269
14 libc10_cuda.so 0x000015318feb5c5f
15 libtorch_python.so 0x00001531e106ddd0
16 libc10.so 0x000015318fde169f
17 libc10.so 0x000015318fdda37b c10::TensorImpl::~TensorImpl() + 539
18 libc10.so 0x000015318fdda529 c10::TensorImpl::~TensorImpl() + 9
19 libtorch_cpu.so 0x00001531d8d0c3b4 c10d::Reducer::~Reducer() + 1476
20 libtorch_python.so 0x00001531e181eb02 std::_Sp_counted_ptr<c10d::Reducer*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 18
21 libtorch_python.so 0x00001531e0f30068 std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 72
22 libtorch_python.so 0x00001531e1829b31
23 libtorch_python.so 0x00001531e0f3b743
24 libtorch_python.so 0x00001531e0f3c2c1
25 python 0x000000000050302a
26 python 0x000000000055effe
27 python 0x0000000000541235
28 python 0x000000000053fc68
29 python 0x000000000053fca4
30 python 0x000000000053fca4
31 python 0x000000000053fca4
32 python 0x000000000053fca4
33 python 0x00000000004f76bb
34 python 0x00000000004fb10b PyDict_SetItemString + 171
35 python 0x00000000005fcb64
36 python 0x00000000005eb943 Py_FinalizeEx + 323
37 python 0x00000000005f70e3 Py_RunMain + 403
38 python 0x00000000005bbf89 Py_BytesMain + 57
39 libc.so.6 0x00001531e9d3f590
40 libc.so.6 0x00001531e9d3f640 __libc_start_main + 128
41 python 0x00000000005bbdd3
W1109 01:23:36.708000 86070 /mnt/beegfs/home/skatar/anaconda3/envs/tmp6/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 86078 closing signal SIGTERM
W1109 01:23:36.717000 86070 /mnt/beegfs/home/skatar/anaconda3/envs/tmp6/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 86081 closing signal SIGTERM
W1109 01:23:36.750000 86070 /mnt/beegfs/home/skatar/anaconda3/envs/tmp6/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 86084 closing signal SIGTERM
E1109 01:23:53.419000 86070 /mnt/beegfs/home/skatar/anaconda3/envs/tmp6/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -6) local_rank: 1 (pid: 86079) of binary: /home/skatar/anaconda3/envs/tmp6/bin/python
Traceback (most recent call last):
File "/home/skatar/anaconda3/envs/tmp6/bin/accelerate", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/skatar/anaconda3/envs/tmp6/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/home/skatar/anaconda3/envs/tmp6/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1159, in launch_command
multi_gpu_launcher(args)
File "/home/skatar/anaconda3/envs/tmp6/lib/python3.11/site-packages/accelerate/commands/launch.py", line 793, in multi_gpu_launcher
distrib_run.run(args)
File "/home/skatar/anaconda3/envs/tmp6/lib/python3.11/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/home/skatar/anaconda3/envs/tmp6/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/skatar/anaconda3/envs/tmp6/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
======================================================
pretrain_iter.py FAILED
------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-11-09_01:23:36
host : node00.cluster
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 86079)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 86079
======================================================