I train my model with DDP with command: NCCL_DEBUG=INFO /bin/python3 -m torch.distributed.launch --nproc_per_node=2 --master_port=1234 /home/toan/HAT_Model/test.py
It throw this error:
warnings.warn(
[2024-08-13 15:44:52,787] torch.distributed.run: [WARNING]
[2024-08-13 15:44:52,787] torch.distributed.run: [WARNING] *****************************************
[2024-08-13 15:44:52,787] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-08-13 15:44:52,787] torch.distributed.run: [WARNING] *****************************************
/home/toan/.local/lib/python3.10/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at …/aten/src/ATen/native/TensorShape.cpp:3526.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
/home/toan/.local/lib/python3.10/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at …/aten/src/ATen/native/TensorShape.cpp:3526.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
DESKTOP-7DHJIHI:82115:82115 [0] NCCL INFO cudaDriverVersion 12020
DESKTOP-7DHJIHI:82116:82116 [0] NCCL INFO cudaDriverVersion 12020
DESKTOP-7DHJIHI:82116:82116 [0] misc/cudawrap.cc:33 NCCL WARN Cuda failure ‘initialization error’
DESKTOP-7DHJIHI:82115:82115 [0] misc/cudawrap.cc:33 NCCL WARN Cuda failure ‘initialization error’
DESKTOP-7DHJIHI:82116:82116 [0] NCCL INFO Bootstrap : Using eth0:172.20.10.6<0>
DESKTOP-7DHJIHI:82115:82115 [0] NCCL INFO Bootstrap : Using eth0:172.20.10.6<0>
DESKTOP-7DHJIHI:82116:82116 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
DESKTOP-7DHJIHI:82115:82115 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
DESKTOP-7DHJIHI:82115:82115 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
DESKTOP-7DHJIHI:82116:82116 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
NCCL version 2.18.1+cuda12.1NCCL version 2.18.1+cuda12.1
DESKTOP-7DHJIHI:82115:82199 [1] NCCL INFO Failed to open libibverbs.so[.1]
DESKTOP-7DHJIHI:82115:82199 [1] NCCL INFO NET/Socket : Using [0]eth0:172.20.10.6<0>
DESKTOP-7DHJIHI:82115:82199 [1] NCCL INFO Using network Socket
DESKTOP-7DHJIHI:82116:82201 [1] NCCL INFO Failed to open libibverbs.so[.1]
DESKTOP-7DHJIHI:82116:82201 [1] NCCL INFO NET/Socket : Using [0]eth0:172.20.10.6<0>
DESKTOP-7DHJIHI:82116:82201 [1] NCCL INFO Using network Socket
DESKTOP-7DHJIHI:82115:82198 [0] NCCL INFO Using network Socket
DESKTOP-7DHJIHI:82116:82200 [0] NCCL INFO Using network Socket
DESKTOP-7DHJIHI:82115:82198 [0] NCCL INFO Channel 00/04 : 0 1
DESKTOP-7DHJIHI:82115:82198 [0] NCCL INFO Channel 01/04 : 0 1
DESKTOP-7DHJIHI:82115:82198 [0] NCCL INFO Channel 02/04 : 0 1
DESKTOP-7DHJIHI:82115:82198 [0] NCCL INFO Channel 03/04 : 0 1
DESKTOP-7DHJIHI:82115:82198 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1
DESKTOP-7DHJIHI:82115:82198 [0] NCCL INFO P2P Chunksize set to 524288
DESKTOP-7DHJIHI:82115:82199 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 [2] -1/-1/-1->1->0 [3] 0/-1/-1->1->-1
DESKTOP-7DHJIHI:82115:82199 [1] NCCL INFO P2P Chunksize set to 524288
DESKTOP-7DHJIHI:82116:82200 [0] NCCL INFO Channel 00/04 : 0 1
DESKTOP-7DHJIHI:82116:82200 [0] NCCL INFO Channel 01/04 : 0 1
DESKTOP-7DHJIHI:82116:82200 [0] NCCL INFO Channel 02/04 : 0 1
DESKTOP-7DHJIHI:82116:82200 [0] NCCL INFO Channel 03/04 : 0 1
DESKTOP-7DHJIHI:82116:82201 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 [2] -1/-1/-1->1->0 [3] 0/-1/-1->1->-1
DESKTOP-7DHJIHI:82116:82201 [1] NCCL INFO P2P Chunksize set to 524288
DESKTOP-7DHJIHI:82116:82200 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1
DESKTOP-7DHJIHI:82116:82200 [0] NCCL INFO P2P Chunksize set to 524288
DESKTOP-7DHJIHI:82115:82199 [1] NCCL INFO Channel 00/0 : 1[81000] → 0[2000] via P2P/direct pointer
DESKTOP-7DHJIHI:82115:82198 [0] NCCL INFO Channel 00/0 : 0[2000] → 1[81000] via P2P/direct pointer
DESKTOP-7DHJIHI:82116:82200 [0] NCCL INFO Channel 00/0 : 0[2000] → 1[81000] via P2P/direct pointer
DESKTOP-7DHJIHI:82116:82201 [1] NCCL INFO Channel 00/0 : 1[81000] → 0[2000] via P2P/direct pointer
DESKTOP-7DHJIHI:82115:82199 [1] NCCL INFO Channel 01/0 : 1[81000] → 0[2000] via P2P/direct pointer
DESKTOP-7DHJIHI:82115:82198 [0] NCCL INFO Channel 01/0 : 0[2000] → 1[81000] via P2P/direct pointer
DESKTOP-7DHJIHI:82116:82201 [1] NCCL INFO Channel 01/0 : 1[81000] → 0[2000] via P2P/direct pointer
DESKTOP-7DHJIHI:82116:82200 [0] NCCL INFO Channel 01/0 : 0[2000] → 1[81000] via P2P/direct pointer
DESKTOP-7DHJIHI:82115:82199 [1] NCCL INFO Channel 02/0 : 1[81000] → 0[2000] via P2P/direct pointer
DESKTOP-7DHJIHI:82115:82198 [0] NCCL INFO Channel 02/0 : 0[2000] → 1[81000] via P2P/direct pointer
DESKTOP-7DHJIHI:82116:82200 [0] NCCL INFO Channel 02/0 : 0[2000] → 1[81000] via P2P/direct pointer
DESKTOP-7DHJIHI:82116:82201 [1] NCCL INFO Channel 02/0 : 1[81000] → 0[2000] via P2P/direct pointer
DESKTOP-7DHJIHI:82115:82199 [1] NCCL INFO Channel 03/0 : 1[81000] → 0[2000] via P2P/direct pointer
DESKTOP-7DHJIHI:82115:82198 [0] NCCL INFO Channel 03/0 : 0[2000] → 1[81000] via P2P/direct pointer
DESKTOP-7DHJIHI:82116:82200 [0] NCCL INFO Channel 03/0 : 0[2000] → 1[81000] via P2P/direct pointer
DESKTOP-7DHJIHI:82116:82201 [1] NCCL INFO Channel 03/0 : 1[81000] → 0[2000] via P2P/direct pointer
DESKTOP-7DHJIHI:82115:82198 [0] NCCL INFO Connected all rings
DESKTOP-7DHJIHI:82115:82198 [0] NCCL INFO Connected all trees
DESKTOP-7DHJIHI:82115:82198 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
DESKTOP-7DHJIHI:82115:82198 [0] NCCL INFO 4 coll channels, 0 nvls channels, 4 p2p channels, 4 p2p channels per peer
DESKTOP-7DHJIHI:82115:82199 [1] NCCL INFO Connected all rings
DESKTOP-7DHJIHI:82115:82199 [1] NCCL INFO Connected all trees
DESKTOP-7DHJIHI:82115:82199 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
DESKTOP-7DHJIHI:82115:82199 [1] NCCL INFO 4 coll channels, 0 nvls channels, 4 p2p channels, 4 p2p channels per peer
DESKTOP-7DHJIHI:82116:82200 [0] NCCL INFO Connected all rings
DESKTOP-7DHJIHI:82116:82201 [1] NCCL INFO Connected all rings
DESKTOP-7DHJIHI:82116:82200 [0] NCCL INFO Connected all trees
DESKTOP-7DHJIHI:82116:82201 [1] NCCL INFO Connected all trees
DESKTOP-7DHJIHI:82116:82201 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
DESKTOP-7DHJIHI:82116:82201 [1] NCCL INFO 4 coll channels, 0 nvls channels, 4 p2p channels, 4 p2p channels per peer
DESKTOP-7DHJIHI:82116:82200 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
DESKTOP-7DHJIHI:82116:82200 [0] NCCL INFO 4 coll channels, 0 nvls channels, 4 p2p channels, 4 p2p channels per peer
DESKTOP-7DHJIHI:82115:82198 [0] NCCL INFO comm 0x55be8d2e1ce0 rank 0 nranks 2 cudaDev 0 busId 2000 commId 0x9bc561efd9a2224a - Init COMPLETE
DESKTOP-7DHJIHI:82115:82199 [1] NCCL INFO comm 0x55be8d2e6a60 rank 1 nranks 2 cudaDev 1 busId 81000 commId 0x9bc561efd9a2224a - Init COMPLETE
DESKTOP-7DHJIHI:82115:82115 [0] misc/strongstream.cc:60 NCCL WARN Cuda failure ‘unknown error’
DESKTOP-7DHJIHI:82115:82115 [0] NCCL INFO enqueue.cc:1550 → 1
DESKTOP-7DHJIHI:82115:82115 [0] NCCL INFO enqueue.cc:1591 → 1
DESKTOP-7DHJIHI:82115:82115 [1] NCCL INFO group.cc:106 → 1
terminate called after throwing an instance of ‘std::runtime_error’
what(): NCCL Error 1: unhandled cuda error (run with NCCL_DEBUG=INFO for details)
DESKTOP-7DHJIHI:82116:82201 [1] NCCL INFO comm 0x55b45e006740 rank 1 nranks 2 cudaDev 1 busId 81000 commId 0xcbe944600c283a75 - Init COMPLETE
DESKTOP-7DHJIHI:82116:82200 [0] NCCL INFO comm 0x55b45e0019c0 rank 0 nranks 2 cudaDev 0 busId 2000 commId 0xcbe944600c283a75 - Init COMPLETE
DESKTOP-7DHJIHI:82116:82116 [0] misc/strongstream.cc:60 NCCL WARN Cuda failure ‘an illegal memory access was encountered’
DESKTOP-7DHJIHI:82116:82116 [0] NCCL INFO enqueue.cc:1550 → 1
DESKTOP-7DHJIHI:82116:82116 [0] NCCL INFO enqueue.cc:1591 → 1
DESKTOP-7DHJIHI:82116:82116 [1] NCCL INFO group.cc:106 → 1
terminate called after throwing an instance of ‘std::runtime_error’
what(): NCCL Error 1: unhandled cuda error (run with NCCL_DEBUG=INFO for details)
[2024-08-13 15:45:12,817] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 0 (pid: 82115) of binary: /bin/python3
Traceback (most recent call last):
File “/usr/lib/python3.10/runpy.py”, line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File “/usr/lib/python3.10/runpy.py”, line 86, in _run_code
exec(code, run_globals)
File “/home/toan/.local/lib/python3.10/site-packages/torch/distributed/launch.py”, line 196, in
main()
File “/home/toan/.local/lib/python3.10/site-packages/torch/distributed/launch.py”, line 192, in main
launch(args)
File “/home/toan/.local/lib/python3.10/site-packages/torch/distributed/launch.py”, line 177, in launch
run(args)
File “/home/toan/.local/lib/python3.10/site-packages/torch/distributed/run.py”, line 797, in run
elastic_launch(
File “/home/toan/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py”, line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File “/home/toan/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py”, line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
/home/toan/HAT_Model/test.py FAILED
Failures:
[1]:
time : 2024-08-13_15:45:12
host : DESKTOP-7DHJIHI.
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 82116)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 82116
Root Cause (first observed failure):
[0]:
time : 2024-08-13_15:45:12
host : DESKTOP-7DHJIHI.
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 82115)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 82115
My hardware: 2 Gpu 2080ti, cuda 11.5. I have already test nccl and it can run well but when I test with code, it thrown this bug