Hello there,
I am doing a testing script on multiple nodes, and each node has 4 v100 GPUs. The output shows the model was trained till the last epoch, but errors did occur before and after the actual training code. So, I am not sure the training is ok or not.
Detailed output is as below (Sorry that some were deleted as it is too long for posting):
MASTER_ADDR:MASTER_PORT=10.148.203.143:14019
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
entrypoint : test_ddp.py
min_nodes : 2
max_nodes : 2
nproc_per_node : 4
run_id : none
rdzv_backend : c10d
rdzv_endpoint : 10.148.203.143:14019
rdzv_configs : {'timeout': 900}
max_restarts : 0
monitor_interval : 5
log_dir : None
metrics_cfg : {}
...........................
**[W socket.cpp:401] [c10d] The server socket has failed to listen on [::]:14019 (errno: 98 - Address already in use).**
**[W socket.cpp:401] [c10d] The server socket has failed to bind to [::]:14019 (errno: 98 - Address already in use).**
**[W socket.cpp:401] [c10d] The server socket has failed to bind to [::]:14019 (errno: 98 - Address already in use).**
**[W socket.cpp:401] [c10d] The server socket has failed to bind to 0.0.0.0:14019 (errno: 98 - Address already in use).**
**[E socket.cpp:435] [c10d] The server socket has failed to listen on any local network address.**
**[W socket.cpp:401] [c10d] The server socket has failed to bind to 0.0.0.0:14019 (errno: 98 - Address already in use).**
**[E socket.cpp:435] [c10d] The server socket has failed to listen on any local network address.**
**[W socket.cpp:401] [c10d] The server socket has failed to bind to 0.0.0.0:14019 (errno: 98 - Address already in use).**
**[E socket.cpp:435] [c10d] The server socket has failed to listen on any local network address.**
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /dev/shm/qwang33_5334019/torchelastic_1z0_h45a/none_kodc9foc
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /dev/shm/qwang33_5334019/torchelastic__ml4lpev/none_0tllp_t2
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /dev/shm/qwang33_5334019/torchelastic_l07k4xuu/none_in7jgqal
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /dev/shm/qwang33_5334019/torchelastic_je7yozg1/none_1adknnrf
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /dev/shm/qwang33_5334019/torchelastic_28s7v8py/none_lzfvxhvt
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /dev/shm/qwang33_5334019/torchelastic_3r_uo2v8/none_tm2sooza
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /dev/shm/qwang33_5334019/torchelastic__nxolux5/none_ygsb7b23
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /dev/shm/qwang33_5334019/torchelastic_gr2a24ff/none_1bl0esrq
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=0
master_addr=r2i7n6.ib0.icexa.epcc.ed.ac.uk
master_port=55839
group_rank=0
group_world_size=2
local_ranks=[0, 1, 2, 3]
role_ranks=[0, 1, 2, 3]
global_ranks=[0, 1, 2, 3]
role_world_sizes=[8, 8, 8, 8]
global_world_sizes=[8, 8, 8, 8]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /dev/shm/qwang33_5334019/torchelastic_1z0_h45a/none_kodc9foc/attempt_0/0/error.json
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=0
master_addr=r2i7n6.ib0.icexa.epcc.ed.ac.uk
master_port=55839
group_rank=1
group_world_size=2
local_ranks=[0, 1, 2, 3]
role_ranks=[4, 5, 6, 7]
global_ranks=[4, 5, 6, 7]
role_world_sizes=[8, 8, 8, 8]
global_world_sizes=[8, 8, 8, 8]
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /dev/shm/qwang33_5334019/torchelastic_1z0_h45a/none_kodc9foc/attempt_0/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /dev/shm/qwang33_5334019/torchelastic_1z0_h45a/none_kodc9foc/attempt_0/2/error.json
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /dev/shm/qwang33_5334019/torchelastic_1z0_h45a/none_kodc9foc/attempt_0/3/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /dev/shm/qwang33_5334019/torchelastic_28s7v8py/none_lzfvxhvt/attempt_0/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /dev/shm/qwang33_5334019/torchelastic_28s7v8py/none_lzfvxhvt/attempt_0/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /dev/shm/qwang33_5334019/torchelastic_28s7v8py/none_lzfvxhvt/attempt_0/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /dev/shm/qwang33_5334019/torchelastic_28s7v8py/none_lzfvxhvt/attempt_0/3/error.json
[init] == local rank: 3, global rank: 3, on r2i7n6 ==
[init] == local rank: 2, global rank: 2, on r2i7n6 ==
[init] == local rank: 0, global rank: 0, on r2i7n6 ==
[init] == local rank: 1, global rank: 1, on r2i7n6 ==
[init] == local rank: 1, global rank: 5, on r2i7n7 ==
[init] == local rank: 3, global rank: 7, on r2i7n7 ==
[init] == local rank: 0, global rank: 4, on r2i7n7 ==
[init] == local rank: 2, global rank: 6, on r2i7n7 ==
Loading snapshot and resuming from snapshot .......
r2i7n6:3830128:3830128 [0] NCCL INFO Bootstrap : Using ib0:10.148.203.143<0>
r2i7n6:3830128:3830128 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
r2i7n6:3830128:3830128 [0] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.148.203.143<0>
r2i7n6:3830128:3830128 [0] NCCL INFO Using network IB
r2i7n7:3078438:3078438 [2] NCCL INFO Bootstrap : Using ib0:10.148.203.144<0>
r2i7n7:3078436:3078436 [0] NCCL INFO Bootstrap : Using ib0:10.148.203.144<0>
r2i7n7:3078439:3078439 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
r2i7n7:3078436:3078436 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
r2i7n7:3078437:3078437 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
r2i7n7:3078438:3078438 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
NCCL version 2.10.3+cuda11.6
r2i7n6:3830129:3830129 [1] NCCL INFO Bootstrap : Using ib0:10.148.203.143<0>
r2i7n6:3830131:3830131 [3] NCCL INFO Bootstrap : Using ib0:10.148.203.143<0>
r2i7n6:3830130:3830130 [2] NCCL INFO Bootstrap : Using ib0:10.148.203.143<0>
r2i7n6:3830129:3830129 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
r2i7n6:3830131:3830131 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
r2i7n6:3830130:3830130 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
r2i7n6:3830131:3830131 [3] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.148.203.143<0>
r2i7n6:3830131:3830131 [3] NCCL INFO Using network IB
r2i7n6:3830129:3830129 [1] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.148.203.143<0>
r2i7n6:3830129:3830129 [1] NCCL INFO Using network IB
r2i7n6:3830130:3830130 [2] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.148.203.143<0>
r2i7n6:3830130:3830130 [2] NCCL INFO Using network IB
[2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.148.203.144<0>
r2i7n7:3078437:3078437 [1] NCCL INFO Using network IB
r2i7n7:3078436:3078436 [0] NCCL INFO Using network IB
r2i7n7:3078439:3078439 [3] NCCL INFO Using network IB
r2i7n6:3830130:3830230 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/6/-1->2->-1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->6
r2i7n6:3830131:3830228 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] 0/-1/-1->3->2 [2] -1/-1/-1->3->2 [3] 0/-1/-1->3->2
r2i7n6:3830128:3830215 [0] NCCL INFO Channel 00/04 : 0 3 2 1 4 7 6 5
r2i7n6:3830130:3830230 [2] NCCL INFO Setting affinity for GPU 2 to ffff,f00000ff,fff00000
r2i7n6:3830129:3830229 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] -1/-1/-1->1->0
r2i7n6:3830128:3830215 [0] NCCL INFO Channel 01/04 : 0 3 6 5 4 7 2 1
r2i7n6:3830131:3830228 [3] NCCL INFO Setting affinity for GPU 3 to ffff,f00000ff,fff00000
r2i7n6:3830128:3830215 [0] NCCL INFO Channel 02/04 : 0 3 2 1 4 7 6 5
r2i7n6:3830129:3830229 [1] NCCL INFO Setting affinity for GPU 1 to 0fffff00,000fffff
r2i7n6:3830128:3830215 [0] NCCL INFO Channel 03/04 : 0 3 6 5 4 7 2 1
r2i7n6:3830128:3830215 [0] NCCL INFO Trees [0] 1/4/-1->0->-1 [1] 1/-1/-1->0->3 [2] 1/-1/-1->0->4 [3] 1/-1/-1->0->3
r2i7n6:3830128:3830215 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff00,000fffff
r2i7n7:3078437:3078535 [1] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] -1/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] -1/-1/-1->5->4
r2i7n7:3078438:3078534 [2] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->2 [2] 7/-1/-1->6->5 [3] 7/2/-1->6->-1
r2i7n7:3078436:3078537 [0] NCCL INFO Trees [0] 5/-1/-1->4->0 [1] 5/-1/-1->4->7 [2] 5/0/-1->4->-1 [3] 5/-1/-1->4->7
r2i7n7:3078439:3078536 [3] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] 4/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] 4/-1/-1->7->6
r2i7n7:3078437:3078535 [1] NCCL INFO Setting affinity for GPU 1 to 0fffff00,000fffff
r2i7n7:3078436:3078537 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff00,000fffff
r2i7n7:3078438:3078534 [2] NCCL INFO Setting affinity for GPU 2 to ffff,f00000ff,fff00000
r2i7n7:3078439:3078536 [3] NCCL INFO Setting affinity for GPU 3 to ffff,f00000ff,fff00000
r2i7n6:3830130:3830230 [2] NCCL INFO Channel 01 : 7[8a000] -> 2[88000] [receive] via NET/IB/1
r2i7n7:3078436:3078537 [0] NCCL INFO Channel 00 : 1[1c000] -> 4[1a000] [receive] via NET/IB/0
r2i7n7:3078437:3078535 [1] NCCL INFO Channel 00 : 5[1c000] -> 0[1a000] [send] via NET/IB/0
r2i7n6:3830128:3830215 [0] NCCL INFO Channel 00 : 5[1c000] -> 0[1a000] [receive] via NET/IB/0
r2i7n6:3830129:3830229 [1] NCCL INFO Channel 00 : 1[1c000] -> 4[1a000] [send] via NET/IB/0
...........
r2i7n7:3078436:3078537 [0] NCCL INFO Channel 02 : 1[1c000] -> 4[1a000] [receive] via NET/IB/0
r2i7n7:3078437:3078535 [1] NCCL INFO Channel 02 : 5[1c000] -> 0[1a000] [send] via NET/IB/0
r2i7n6:3830128:3830215 [0] NCCL INFO Channel 02 : 5[1c000] -> 0[1a000] [receive] via NET/IB/0
r2i7n6:3830129:3830229 [1] NCCL INFO Channel 02 : 1[1c000] -> 4[1a000] [send] via NET/IB/0
r2i7n6:3830128:3830215 [0] NCCL INFO Channel 00 : 0[1a000] -> 3[8a000] via P2P/IPC
.....
r2i7n7:3078438:3078534 [2] NCCL INFO Channel 01 : 6[88000] -> 5[1c000] via P2P/IPC
r2i7n7:3078438:3078534 [2] NCCL INFO Channel 02 : 6[88000] -> 5[1c000] via P2P/IPC
.............
r2i7n7:3078439:3078536 [3] NCCL INFO Channel 02 : 7[8a000] -> 6[88000] via P2P/IPC
r2i7n7:3078439:3078536 [3] NCCL INFO Connected all rings
r2i7n6:3830130:3830230 [2] NCCL INFO Channel 00 : 2[88000] -> 1[1c000] via P2P/IPC
r2i7n6:3830130:3830230 [2] NCCL INFO Channel 01 : 2[88000] -> 1[1c000] via P2P/IPC
r2i7n6:3830130:3830230 [2] NCCL INFO Channel 02 : 2[88000] -> 1[1c000] via P2P/IPC
r2i7n7:3078438:3078534 [2] NCCL INFO Connected all rings
r2i7n6:3830130:3830230 [2] NCCL INFO Channel 03 : 2[88000] -> 1[1c000] via P2P/IPC
r2i7n6:3830131:3830228 [3] NCCL INFO Connected all rings
r2i7n7:3078438:3078534 [2] NCCL INFO Channel 00 : 6[88000] -> 7[8a000] via P2P/IPC
r2i7n7:3078438:3078534 [2] NCCL INFO Channel 01 : 6[88000] -> 7[8a000] via P2P/IPC
......
r2i7n7:3078439:3078536 [3] NCCL INFO Channel 03 : 7[8a000] -> 4[1a000] via P2P/IPC
r2i7n6:3830129:3830229 [1] NCCL INFO Channel 01 : 1[1c000] -> 0[1a000] via P2P/IPC
r2i7n6:3830129:3830229 [1] NCCL INFO Channel 03 : 1[1c000] -> 0[1a000] via P2P/IPC
r2i7n6:3830128:3830215 [0] NCCL INFO Connected all rings
r2i7n6:3830128:3830215 [0] NCCL INFO Channel 00 : 0[1a000] -> 1[1c000] via P2P/IPC
r2i7n7:3078436:3078537 [0] NCCL INFO Connected all rings
r2i7n6:3830128:3830215 [0] NCCL INFO Channel 01 : 0[1a000] -> 1[1c000] via P2P/IPC
r2i7n7:3078436:3078537 [0] NCCL INFO Channel 00 : 4[1a000] -> 5[1c000] via P2P/IPC
r2i7n6:3830130:3830230 [2] NCCL INFO Connected all rings
r2i7n6:3830129:3830229 [1] NCCL INFO Connected all rings
r2i7n7:3078436:3078537 [0] NCCL INFO Channel 01 : 4[1a000] -> 5[1c000] via P2P/IPC
r2i7n6:3830128:3830215 [0] NCCL INFO Channel 02 : 0[1a000] -> 1[1c000] via P2P/IPC
r2i7n7:3078437:3078535 [1] NCCL INFO Connected all rings
r2i7n7:3078436:3078537 [0] NCCL INFO Channel 02 : 4[1a000] -> 5[1c000] via P2P/IPC
r2i7n6:3830128:3830215 [0] NCCL INFO Channel 03 : 0[1a000] -> 1[1c000] via P2P/IPC
r2i7n7:3078436:3078537 [0] NCCL INFO Channel 03 : 4[1a000] -> 5[1c000] via P2P/IPC
r2i7n6:3830130:3830230 [2] NCCL INFO Channel 00 : 2[88000] -> 3[8a000] via P2P/IPC
r2i7n7:3078437:3078535 [1] NCCL INFO Channel 00 : 5[1c000] -> 6[88000] via P2P/IPC
......
r2i7n7:3078436:3078537 [0] NCCL INFO Channel 02 : 0[1a000] -> 4[1a000] [receive] via NET/IB/0
r2i7n7:3078438:3078534 [2] NCCL INFO Channel 03 : 2[88000] -> 6[88000] [receive] via NET/IB/1
........
r2i7n7:3078436:3078537 [0] NCCL INFO Channel 00 : 4[1a000] -> 0[1a000] [send] via NET/IB/0
r2i7n7:3078438:3078534 [2] NCCL INFO Channel 01 : 6[88000] -> 2[88000] [send] via NET/IB/1
......
r2i7n6:3830130:3830230 [2] NCCL INFO Channel 03 : 2[88000] -> 6[88000] [send] via NET/IB/1
r2i7n7:3078439:3078536 [3] NCCL INFO Channel 01 : 7[8a000] -> 6[88000] via P2P/IPC
r2i7n6:3830131:3830228 [3] NCCL INFO Channel 01 : 3[8a000] -> 2[88000] via P2P/IPC
r2i7n6:3830131:3830228 [3] NCCL INFO Channel 03 : 3[8a000] -> 2[88000] via P2P/IPC
r2i7n7:3078439:3078536 [3] NCCL INFO Channel 03 : 7[8a000] -> 6[88000] via P2P/IPC
r2i7n6:3830128:3830215 [0] NCCL INFO Connected all trees
r2i7n7:3078437:3078535 [1] NCCL INFO Connected all trees
r2i7n7:3078437:3078535 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
r2i7n7:3078437:3078535 [1] NCCL INFO 4 coll channels, 4 p2p channels, 1 p2p channels per peer
r2i7n7:3078436:3078537 [0] NCCL INFO Connected all trees
r2i7n7:3078436:3078537 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
r2i7n7:3078436:3078537 [0] NCCL INFO 4 coll channels, 4 p2p channels, 1 p2p channels per peer
r2i7n6:3830129:3830229 [1] NCCL INFO Connected all trees
r2i7n6:3830130:3830230 [2] NCCL INFO Connected all trees
r2i7n6:3830131:3830228 [3] NCCL INFO Connected all trees
r2i7n7:3078438:3078534 [2] NCCL INFO Connected all trees
r2i7n7:3078438:3078534 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
r2i7n7:3078438:3078534 [2] NCCL INFO 4 coll channels, 4 p2p channels, 1 p2p channels per peer
r2i7n7:3078439:3078536 [3] NCCL INFO Connected all trees
r2i7n7:3078439:3078536 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
r2i7n7:3078439:3078536 [3] NCCL INFO 4 coll channels, 4 p2p channels, 1 p2p channels per peer
r2i7n6:3830129:3830229 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
r2i7n6:3830130:3830230 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
r2i7n6:3830131:3830228 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
r2i7n6:3830130:3830230 [2] NCCL INFO 4 coll channels, 4 p2p channels, 1 p2p channels per peer
r2i7n6:3830129:3830229 [1] NCCL INFO 4 coll channels, 4 p2p channels, 1 p2p channels per peer
r2i7n6:3830131:3830228 [3] NCCL INFO 4 coll channels, 4 p2p channels, 1 p2p channels per peer
r2i7n6:3830128:3830215 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
r2i7n6:3830128:3830215 [0] NCCL INFO 4 coll channels, 4 p2p channels, 1 p2p channels per peer
r2i7n7:3078436:3078537 [0] NCCL INFO comm 0x14e60c002fb0 rank 4 nranks 8 cudaDev 0 busId 1a000 - Init COMPLETE
r2i7n7:3078439:3078536 [3] NCCL INFO comm 0x153524002fb0 rank 7 nranks 8 cudaDev 3 busId 8a000 - Init COMPLETE
r2i7n7:3078438:3078534 [2] NCCL INFO comm 0x146f7c002fb0 rank 6 nranks 8 cudaDev 2 busId 88000 - Init COMPLETE
r2i7n7:3078437:3078535 [1] NCCL INFO comm 0x14dbf4002fb0 rank 5 nranks 8 cudaDev 1 busId 1c000 - Init COMPLETE
r2i7n6:3830130:3830230 [2] NCCL INFO comm 0x14f4dc002fb0 rank 2 nranks 8 cudaDev 2 busId 88000 - Init COMPLETE
r2i7n6:3830128:3830215 [0] NCCL INFO comm 0x14e560002fb0 rank 0 nranks 8 cudaDev 0 busId 1a000 - Init COMPLETE
r2i7n6:3830129:3830229 [1] NCCL INFO comm 0x154294002fb0 rank 1 nranks 8 cudaDev 1 busId 1c000 - Init COMPLETE
r2i7n6:3830131:3830228 [3] NCCL INFO comm 0x14691c002fb0 rank 3 nranks 8 cudaDev 3 busId 8a000 - Init COMPLETE
r2i7n6:3830128:3830128 [0] NCCL INFO Launch mode Parallel
[GPU0] Epoch 100 | Batchsize: 32 | Steps: 8[GPU2] Epoch 100 | Batchsize: 32 | Steps: 8
[GPU6] Epoch 100 | Batchsize: 32 | Steps: 8
[GPU4] Epoch 100 | Batchsize: 32 | Steps: 8
Long training output ......................
[GPU7] Epoch 999 | Batchsize: 32 | Steps: 8
[GPU5] Epoch 999 | Batchsize: 32 | Steps: 8
INFO:torch.distributed.elastic.agent.server.api:[default] worker group successfully finished. Waiting 300 seconds for other agents to finish.
INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (SUCCEEDED). Waiting 300 seconds for other agents to finish
INFO:torch.distributed.elastic.agent.server.api:[default] worker group successfully finished. Waiting 300 seconds for other agents to finish.
INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (SUCCEEDED). Waiting 300 seconds for other agents to finish
INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.004100799560546875 seconds
INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0007669925689697266 seconds
WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'r2i7n7.ib0.icexa.epcc.ed.ac.uk_3078380_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError.
Traceback (most recent call last):
File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store
WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'r2i7n7.ib0.icexa.epcc.ed.ac.uk_3078381_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError.
WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'r2i7n7.ib0.icexa.epcc.ed.ac.uk_3078379_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError.
Traceback (most recent call last):
File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store
Traceback (most recent call last):
File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store
WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'r2i7n6.ib0.icexa.epcc.ed.ac.uk_3830073_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError.
Traceback (most recent call last):
File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store
return getattr(self._store, store_op)(*args, **kwargs)
WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'r2i7n6.ib0.icexa.epcc.ed.ac.uk_3830074_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError.
return getattr(self._store, store_op)(*args, **kwargs)
RuntimeError: Connection reset by peer
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
return getattr(self._store, store_op)(*args, **kwargs)
RuntimeError: Broken pipe
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/bin/torchrun", line 8, in <module>
return f(*args, **kwargs)
File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/run.py", line 761, in main
sys.exit(main())
File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
sys.exit(main())
File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
return f(*args, **kwargs)
File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/run.py", line 761, in main
run(args)
File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/run.py", line 752, in run
return launch_agent(self._config, self._entrypoint, list(args))
File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 236, in launch_agent
elastic_launch(
File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
elastic_launch(
File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 236, in launch_agent
elastic_launch(
File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
result = agent.run()
File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", line 678, in _initialize_workers
File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", line 678, in _initialize_workers
result = f(*args, **kwargs)
File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", line 678, in _initialize_workers
self._rendezvous(worker_group)
File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper
result = f(*args, **kwargs)
File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-
get_response = self._backend.get_state()
File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state
raise RendezvousConnectionError(
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.
has_set = self._state_holder.sync()
File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 408, in sync
base64_state: bytes = self._call_store("get", self._key)
File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store
get_response = self._backend.get_state()
raise RendezvousConnectionError(
File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.
WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'r2i7n6.ib0.icexa.epcc.ed.ac.uk_3830075_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError.
Traceback (most recent call last):
File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store
return getattr(self._store, store_op)(*args, **kwargs)
RuntimeError: Broken pipe
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
return f(*args, **kwargs)
File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/run.py", line 761, in main
run(args)
File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 236, in launch_agent
result = agent.run()
File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper
result = f(*args, **kwargs)
File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", line 678, in _initialize_workers
self._rendezvous(worker_group)
File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper
result = f(*args, **kwargs)
File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", line 538, in _rendezvous
store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()
File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1024, in next_rendezvous
self._op_executor.run(join_op, deadline)
File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 606, in run
has_set = self._state_holder.sync()
File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 408, in sync
get_response = self._backend.get_state()
File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state
base64_state: bytes = self._call_store("get", self._key)
File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store
raise RendezvousConnectionError(
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.
srun: error: r2i7n7: tasks 4-6: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=5334019.0
srun: error: r2i7n6: tasks 0-2: Exited with exit code 1
The code is very simple from https://github.com/pytorch/examples/blob/main/distributed/ddp-tutorial-series/multinode.py
I used the Slurm to submit the job:
#SBATCH --job-name=f2h_SD_nnodes
#SBATCH --partition=gpu
#SBATCH --qos=gpu
#SBATCH --gres=gpu:4
#SBATCH --nodes=2
#SBATCH --exclusive
#SBATCH --time=96:00:00
source /work/ec204/ec204/qwang33/myenv/bin/activate
export OMP_NUM_THREADS=10
master_name=$(scontrol show hostname ${SLURM_NODELIST} | head -n 1)
master_addr="$(host ${master_name})"
master_ipaddr=$(echo "$master_addr" | awk '{print $NF}')
export MASTER_ADDR=$master_ipaddr
export MASTER_PORT=$(expr 10000 + $(echo -n $SLURM_JOBID | tail -c 4))
echo "MASTER_ADDR:MASTER_PORT="${MASTER_ADDR}:${MASTER_PORT}
#export NCCL_IB_DISABLE=1
#export NCCL_P2P_DISABLE=1
export LOGLEVEL=INFO
export NCCL_DEBUG=INFO
export NCCL_SOCKET_IFNAME=ib0
srun --nodes=2 --ntasks=8 --ntasks-per-node=4 --cpus-per-task=5 \
torchrun --nnodes=2 --nproc_per_node=4 \
--rdzv_backend=c10d --rdzv_endpoint $MASTER_ADDR:$MASTER_PORT \
test_ddp.py 1000 100 --batch_size=32
The environment I used is:
python/3.9.13
pytorch/1.12.1
CUDA: nvidia/nvhpc/22.11
In addition, I tried this with pytorch 2.0 and I got extra two lines at the very beginning of the output, and the rest of the output is pretty much the same. Any tips?
master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.