Error with DDP on multiple nodes

Hello there,

I am doing a testing script on multiple nodes, and each node has 4 v100 GPUs. The output shows the model was trained till the last epoch, but errors did occur before and after the actual training code. So, I am not sure the training is ok or not.

Detailed output is as below (Sorry that some were deleted as it is too long for posting):

MASTER_ADDR:MASTER_PORT=10.148.203.143:14019
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
  entrypoint       : test_ddp.py
  min_nodes        : 2
  max_nodes        : 2
  nproc_per_node   : 4
  run_id           : none
  rdzv_backend     : c10d
  rdzv_endpoint    : 10.148.203.143:14019
  rdzv_configs     : {'timeout': 900}
  max_restarts     : 0
  monitor_interval : 5
  log_dir          : None
  metrics_cfg      : {}

...........................


**[W socket.cpp:401] [c10d] The server socket has failed to listen on [::]:14019 (errno: 98 - Address already in use).**
**[W socket.cpp:401] [c10d] The server socket has failed to bind to [::]:14019 (errno: 98 - Address already in use).**
**[W socket.cpp:401] [c10d] The server socket has failed to bind to [::]:14019 (errno: 98 - Address already in use).**
**[W socket.cpp:401] [c10d] The server socket has failed to bind to 0.0.0.0:14019 (errno: 98 - Address already in use).**
**[E socket.cpp:435] [c10d] The server socket has failed to listen on any local network address.**
**[W socket.cpp:401] [c10d] The server socket has failed to bind to 0.0.0.0:14019 (errno: 98 - Address already in use).**
**[E socket.cpp:435] [c10d] The server socket has failed to listen on any local network address.**
**[W socket.cpp:401] [c10d] The server socket has failed to bind to 0.0.0.0:14019 (errno: 98 - Address already in use).**
**[E socket.cpp:435] [c10d] The server socket has failed to listen on any local network address.**
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /dev/shm/qwang33_5334019/torchelastic_1z0_h45a/none_kodc9foc
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /dev/shm/qwang33_5334019/torchelastic__ml4lpev/none_0tllp_t2
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /dev/shm/qwang33_5334019/torchelastic_l07k4xuu/none_in7jgqal
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /dev/shm/qwang33_5334019/torchelastic_je7yozg1/none_1adknnrf
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /dev/shm/qwang33_5334019/torchelastic_28s7v8py/none_lzfvxhvt
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /dev/shm/qwang33_5334019/torchelastic_3r_uo2v8/none_tm2sooza
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /dev/shm/qwang33_5334019/torchelastic__nxolux5/none_ygsb7b23
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /dev/shm/qwang33_5334019/torchelastic_gr2a24ff/none_1bl0esrq
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=0
  master_addr=r2i7n6.ib0.icexa.epcc.ed.ac.uk
  master_port=55839
  group_rank=0
  group_world_size=2
  local_ranks=[0, 1, 2, 3]
  role_ranks=[0, 1, 2, 3]
  global_ranks=[0, 1, 2, 3]
  role_world_sizes=[8, 8, 8, 8]
  global_world_sizes=[8, 8, 8, 8]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /dev/shm/qwang33_5334019/torchelastic_1z0_h45a/none_kodc9foc/attempt_0/0/error.json
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=0
  master_addr=r2i7n6.ib0.icexa.epcc.ed.ac.uk
  master_port=55839
  group_rank=1
  group_world_size=2
  local_ranks=[0, 1, 2, 3]
  role_ranks=[4, 5, 6, 7]
  global_ranks=[4, 5, 6, 7]
  role_world_sizes=[8, 8, 8, 8]
  global_world_sizes=[8, 8, 8, 8]

INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /dev/shm/qwang33_5334019/torchelastic_1z0_h45a/none_kodc9foc/attempt_0/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /dev/shm/qwang33_5334019/torchelastic_1z0_h45a/none_kodc9foc/attempt_0/2/error.json
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /dev/shm/qwang33_5334019/torchelastic_1z0_h45a/none_kodc9foc/attempt_0/3/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /dev/shm/qwang33_5334019/torchelastic_28s7v8py/none_lzfvxhvt/attempt_0/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /dev/shm/qwang33_5334019/torchelastic_28s7v8py/none_lzfvxhvt/attempt_0/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /dev/shm/qwang33_5334019/torchelastic_28s7v8py/none_lzfvxhvt/attempt_0/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /dev/shm/qwang33_5334019/torchelastic_28s7v8py/none_lzfvxhvt/attempt_0/3/error.json
[init] == local rank: 3, global rank: 3, on r2i7n6  ==
[init] == local rank: 2, global rank: 2, on r2i7n6  ==
[init] == local rank: 0, global rank: 0, on r2i7n6  ==
[init] == local rank: 1, global rank: 1, on r2i7n6  ==
[init] == local rank: 1, global rank: 5, on r2i7n7  ==
[init] == local rank: 3, global rank: 7, on r2i7n7  ==
[init] == local rank: 0, global rank: 4, on r2i7n7  ==
[init] == local rank: 2, global rank: 6, on r2i7n7  ==

Loading snapshot and resuming from snapshot .......

r2i7n6:3830128:3830128 [0] NCCL INFO Bootstrap : Using ib0:10.148.203.143<0>
r2i7n6:3830128:3830128 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
r2i7n6:3830128:3830128 [0] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.148.203.143<0>
r2i7n6:3830128:3830128 [0] NCCL INFO Using network IB
r2i7n7:3078438:3078438 [2] NCCL INFO Bootstrap : Using ib0:10.148.203.144<0>
r2i7n7:3078436:3078436 [0] NCCL INFO Bootstrap : Using ib0:10.148.203.144<0>
r2i7n7:3078439:3078439 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
r2i7n7:3078436:3078436 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
r2i7n7:3078437:3078437 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
r2i7n7:3078438:3078438 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
NCCL version 2.10.3+cuda11.6
r2i7n6:3830129:3830129 [1] NCCL INFO Bootstrap : Using ib0:10.148.203.143<0>
r2i7n6:3830131:3830131 [3] NCCL INFO Bootstrap : Using ib0:10.148.203.143<0>
r2i7n6:3830130:3830130 [2] NCCL INFO Bootstrap : Using ib0:10.148.203.143<0>
r2i7n6:3830129:3830129 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
r2i7n6:3830131:3830131 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
r2i7n6:3830130:3830130 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
r2i7n6:3830131:3830131 [3] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.148.203.143<0>
r2i7n6:3830131:3830131 [3] NCCL INFO Using network IB
r2i7n6:3830129:3830129 [1] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.148.203.143<0>
r2i7n6:3830129:3830129 [1] NCCL INFO Using network IB
r2i7n6:3830130:3830130 [2] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.148.203.143<0>
r2i7n6:3830130:3830130 [2] NCCL INFO Using network IB
[2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.148.203.144<0>
r2i7n7:3078437:3078437 [1] NCCL INFO Using network IB
r2i7n7:3078436:3078436 [0] NCCL INFO Using network IB
r2i7n7:3078439:3078439 [3] NCCL INFO Using network IB
r2i7n6:3830130:3830230 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/6/-1->2->-1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->6
r2i7n6:3830131:3830228 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] 0/-1/-1->3->2 [2] -1/-1/-1->3->2 [3] 0/-1/-1->3->2
r2i7n6:3830128:3830215 [0] NCCL INFO Channel 00/04 :    0   3   2   1   4   7   6   5
r2i7n6:3830130:3830230 [2] NCCL INFO Setting affinity for GPU 2 to ffff,f00000ff,fff00000
r2i7n6:3830129:3830229 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] -1/-1/-1->1->0
r2i7n6:3830128:3830215 [0] NCCL INFO Channel 01/04 :    0   3   6   5   4   7   2   1
r2i7n6:3830131:3830228 [3] NCCL INFO Setting affinity for GPU 3 to ffff,f00000ff,fff00000
r2i7n6:3830128:3830215 [0] NCCL INFO Channel 02/04 :    0   3   2   1   4   7   6   5
r2i7n6:3830129:3830229 [1] NCCL INFO Setting affinity for GPU 1 to 0fffff00,000fffff
r2i7n6:3830128:3830215 [0] NCCL INFO Channel 03/04 :    0   3   6   5   4   7   2   1
r2i7n6:3830128:3830215 [0] NCCL INFO Trees [0] 1/4/-1->0->-1 [1] 1/-1/-1->0->3 [2] 1/-1/-1->0->4 [3] 1/-1/-1->0->3
r2i7n6:3830128:3830215 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff00,000fffff
r2i7n7:3078437:3078535 [1] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] -1/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] -1/-1/-1->5->4
r2i7n7:3078438:3078534 [2] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->2 [2] 7/-1/-1->6->5 [3] 7/2/-1->6->-1
r2i7n7:3078436:3078537 [0] NCCL INFO Trees [0] 5/-1/-1->4->0 [1] 5/-1/-1->4->7 [2] 5/0/-1->4->-1 [3] 5/-1/-1->4->7
r2i7n7:3078439:3078536 [3] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] 4/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] 4/-1/-1->7->6
r2i7n7:3078437:3078535 [1] NCCL INFO Setting affinity for GPU 1 to 0fffff00,000fffff
r2i7n7:3078436:3078537 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff00,000fffff
r2i7n7:3078438:3078534 [2] NCCL INFO Setting affinity for GPU 2 to ffff,f00000ff,fff00000
r2i7n7:3078439:3078536 [3] NCCL INFO Setting affinity for GPU 3 to ffff,f00000ff,fff00000
r2i7n6:3830130:3830230 [2] NCCL INFO Channel 01 : 7[8a000] -> 2[88000] [receive] via NET/IB/1
r2i7n7:3078436:3078537 [0] NCCL INFO Channel 00 : 1[1c000] -> 4[1a000] [receive] via NET/IB/0
r2i7n7:3078437:3078535 [1] NCCL INFO Channel 00 : 5[1c000] -> 0[1a000] [send] via NET/IB/0
r2i7n6:3830128:3830215 [0] NCCL INFO Channel 00 : 5[1c000] -> 0[1a000] [receive] via NET/IB/0
r2i7n6:3830129:3830229 [1] NCCL INFO Channel 00 : 1[1c000] -> 4[1a000] [send] via NET/IB/0
...........
r2i7n7:3078436:3078537 [0] NCCL INFO Channel 02 : 1[1c000] -> 4[1a000] [receive] via NET/IB/0
r2i7n7:3078437:3078535 [1] NCCL INFO Channel 02 : 5[1c000] -> 0[1a000] [send] via NET/IB/0
r2i7n6:3830128:3830215 [0] NCCL INFO Channel 02 : 5[1c000] -> 0[1a000] [receive] via NET/IB/0
r2i7n6:3830129:3830229 [1] NCCL INFO Channel 02 : 1[1c000] -> 4[1a000] [send] via NET/IB/0
r2i7n6:3830128:3830215 [0] NCCL INFO Channel 00 : 0[1a000] -> 3[8a000] via P2P/IPC
.....
r2i7n7:3078438:3078534 [2] NCCL INFO Channel 01 : 6[88000] -> 5[1c000] via P2P/IPC
r2i7n7:3078438:3078534 [2] NCCL INFO Channel 02 : 6[88000] -> 5[1c000] via P2P/IPC
.............
r2i7n7:3078439:3078536 [3] NCCL INFO Channel 02 : 7[8a000] -> 6[88000] via P2P/IPC
r2i7n7:3078439:3078536 [3] NCCL INFO Connected all rings
r2i7n6:3830130:3830230 [2] NCCL INFO Channel 00 : 2[88000] -> 1[1c000] via P2P/IPC
r2i7n6:3830130:3830230 [2] NCCL INFO Channel 01 : 2[88000] -> 1[1c000] via P2P/IPC
r2i7n6:3830130:3830230 [2] NCCL INFO Channel 02 : 2[88000] -> 1[1c000] via P2P/IPC
r2i7n7:3078438:3078534 [2] NCCL INFO Connected all rings
r2i7n6:3830130:3830230 [2] NCCL INFO Channel 03 : 2[88000] -> 1[1c000] via P2P/IPC
r2i7n6:3830131:3830228 [3] NCCL INFO Connected all rings
r2i7n7:3078438:3078534 [2] NCCL INFO Channel 00 : 6[88000] -> 7[8a000] via P2P/IPC
r2i7n7:3078438:3078534 [2] NCCL INFO Channel 01 : 6[88000] -> 7[8a000] via P2P/IPC
......
r2i7n7:3078439:3078536 [3] NCCL INFO Channel 03 : 7[8a000] -> 4[1a000] via P2P/IPC
r2i7n6:3830129:3830229 [1] NCCL INFO Channel 01 : 1[1c000] -> 0[1a000] via P2P/IPC
r2i7n6:3830129:3830229 [1] NCCL INFO Channel 03 : 1[1c000] -> 0[1a000] via P2P/IPC
r2i7n6:3830128:3830215 [0] NCCL INFO Connected all rings
r2i7n6:3830128:3830215 [0] NCCL INFO Channel 00 : 0[1a000] -> 1[1c000] via P2P/IPC
r2i7n7:3078436:3078537 [0] NCCL INFO Connected all rings
r2i7n6:3830128:3830215 [0] NCCL INFO Channel 01 : 0[1a000] -> 1[1c000] via P2P/IPC
r2i7n7:3078436:3078537 [0] NCCL INFO Channel 00 : 4[1a000] -> 5[1c000] via P2P/IPC
r2i7n6:3830130:3830230 [2] NCCL INFO Connected all rings
r2i7n6:3830129:3830229 [1] NCCL INFO Connected all rings
r2i7n7:3078436:3078537 [0] NCCL INFO Channel 01 : 4[1a000] -> 5[1c000] via P2P/IPC
r2i7n6:3830128:3830215 [0] NCCL INFO Channel 02 : 0[1a000] -> 1[1c000] via P2P/IPC
r2i7n7:3078437:3078535 [1] NCCL INFO Connected all rings
r2i7n7:3078436:3078537 [0] NCCL INFO Channel 02 : 4[1a000] -> 5[1c000] via P2P/IPC
r2i7n6:3830128:3830215 [0] NCCL INFO Channel 03 : 0[1a000] -> 1[1c000] via P2P/IPC
r2i7n7:3078436:3078537 [0] NCCL INFO Channel 03 : 4[1a000] -> 5[1c000] via P2P/IPC
r2i7n6:3830130:3830230 [2] NCCL INFO Channel 00 : 2[88000] -> 3[8a000] via P2P/IPC
r2i7n7:3078437:3078535 [1] NCCL INFO Channel 00 : 5[1c000] -> 6[88000] via P2P/IPC
......
r2i7n7:3078436:3078537 [0] NCCL INFO Channel 02 : 0[1a000] -> 4[1a000] [receive] via NET/IB/0
r2i7n7:3078438:3078534 [2] NCCL INFO Channel 03 : 2[88000] -> 6[88000] [receive] via NET/IB/1
........
r2i7n7:3078436:3078537 [0] NCCL INFO Channel 00 : 4[1a000] -> 0[1a000] [send] via NET/IB/0
r2i7n7:3078438:3078534 [2] NCCL INFO Channel 01 : 6[88000] -> 2[88000] [send] via NET/IB/1
......
r2i7n6:3830130:3830230 [2] NCCL INFO Channel 03 : 2[88000] -> 6[88000] [send] via NET/IB/1
r2i7n7:3078439:3078536 [3] NCCL INFO Channel 01 : 7[8a000] -> 6[88000] via P2P/IPC
r2i7n6:3830131:3830228 [3] NCCL INFO Channel 01 : 3[8a000] -> 2[88000] via P2P/IPC
r2i7n6:3830131:3830228 [3] NCCL INFO Channel 03 : 3[8a000] -> 2[88000] via P2P/IPC
r2i7n7:3078439:3078536 [3] NCCL INFO Channel 03 : 7[8a000] -> 6[88000] via P2P/IPC
r2i7n6:3830128:3830215 [0] NCCL INFO Connected all trees
r2i7n7:3078437:3078535 [1] NCCL INFO Connected all trees
r2i7n7:3078437:3078535 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
r2i7n7:3078437:3078535 [1] NCCL INFO 4 coll channels, 4 p2p channels, 1 p2p channels per peer
r2i7n7:3078436:3078537 [0] NCCL INFO Connected all trees
r2i7n7:3078436:3078537 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
r2i7n7:3078436:3078537 [0] NCCL INFO 4 coll channels, 4 p2p channels, 1 p2p channels per peer
r2i7n6:3830129:3830229 [1] NCCL INFO Connected all trees
r2i7n6:3830130:3830230 [2] NCCL INFO Connected all trees
r2i7n6:3830131:3830228 [3] NCCL INFO Connected all trees
r2i7n7:3078438:3078534 [2] NCCL INFO Connected all trees
r2i7n7:3078438:3078534 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
r2i7n7:3078438:3078534 [2] NCCL INFO 4 coll channels, 4 p2p channels, 1 p2p channels per peer
r2i7n7:3078439:3078536 [3] NCCL INFO Connected all trees
r2i7n7:3078439:3078536 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
r2i7n7:3078439:3078536 [3] NCCL INFO 4 coll channels, 4 p2p channels, 1 p2p channels per peer
r2i7n6:3830129:3830229 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
r2i7n6:3830130:3830230 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
r2i7n6:3830131:3830228 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
r2i7n6:3830130:3830230 [2] NCCL INFO 4 coll channels, 4 p2p channels, 1 p2p channels per peer
r2i7n6:3830129:3830229 [1] NCCL INFO 4 coll channels, 4 p2p channels, 1 p2p channels per peer
r2i7n6:3830131:3830228 [3] NCCL INFO 4 coll channels, 4 p2p channels, 1 p2p channels per peer
r2i7n6:3830128:3830215 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
r2i7n6:3830128:3830215 [0] NCCL INFO 4 coll channels, 4 p2p channels, 1 p2p channels per peer
r2i7n7:3078436:3078537 [0] NCCL INFO comm 0x14e60c002fb0 rank 4 nranks 8 cudaDev 0 busId 1a000 - Init COMPLETE
r2i7n7:3078439:3078536 [3] NCCL INFO comm 0x153524002fb0 rank 7 nranks 8 cudaDev 3 busId 8a000 - Init COMPLETE
r2i7n7:3078438:3078534 [2] NCCL INFO comm 0x146f7c002fb0 rank 6 nranks 8 cudaDev 2 busId 88000 - Init COMPLETE
r2i7n7:3078437:3078535 [1] NCCL INFO comm 0x14dbf4002fb0 rank 5 nranks 8 cudaDev 1 busId 1c000 - Init COMPLETE
r2i7n6:3830130:3830230 [2] NCCL INFO comm 0x14f4dc002fb0 rank 2 nranks 8 cudaDev 2 busId 88000 - Init COMPLETE
r2i7n6:3830128:3830215 [0] NCCL INFO comm 0x14e560002fb0 rank 0 nranks 8 cudaDev 0 busId 1a000 - Init COMPLETE
r2i7n6:3830129:3830229 [1] NCCL INFO comm 0x154294002fb0 rank 1 nranks 8 cudaDev 1 busId 1c000 - Init COMPLETE
r2i7n6:3830131:3830228 [3] NCCL INFO comm 0x14691c002fb0 rank 3 nranks 8 cudaDev 3 busId 8a000 - Init COMPLETE
r2i7n6:3830128:3830128 [0] NCCL INFO Launch mode Parallel
[GPU0] Epoch 100 | Batchsize: 32 | Steps: 8[GPU2] Epoch 100 | Batchsize: 32 | Steps: 8

[GPU6] Epoch 100 | Batchsize: 32 | Steps: 8
[GPU4] Epoch 100 | Batchsize: 32 | Steps: 8

Long training output ......................

[GPU7] Epoch 999 | Batchsize: 32 | Steps: 8
[GPU5] Epoch 999 | Batchsize: 32 | Steps: 8

INFO:torch.distributed.elastic.agent.server.api:[default] worker group successfully finished. Waiting 300 seconds for other agents to finish.
INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (SUCCEEDED). Waiting 300 seconds for other agents to finish
INFO:torch.distributed.elastic.agent.server.api:[default] worker group successfully finished. Waiting 300 seconds for other agents to finish.
INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (SUCCEEDED). Waiting 300 seconds for other agents to finish
INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.004100799560546875 seconds
INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0007669925689697266 seconds
WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'r2i7n7.ib0.icexa.epcc.ed.ac.uk_3078380_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError.
Traceback (most recent call last):
  File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store
WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'r2i7n7.ib0.icexa.epcc.ed.ac.uk_3078381_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError.
WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'r2i7n7.ib0.icexa.epcc.ed.ac.uk_3078379_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError.
Traceback (most recent call last):
  File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store
Traceback (most recent call last):
  File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store
WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'r2i7n6.ib0.icexa.epcc.ed.ac.uk_3830073_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError.
Traceback (most recent call last):
  File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store
    return getattr(self._store, store_op)(*args, **kwargs)
WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'r2i7n6.ib0.icexa.epcc.ed.ac.uk_3830074_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError.
    return getattr(self._store, store_op)(*args, **kwargs)
RuntimeError: Connection reset by peer

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return getattr(self._store, store_op)(*args, **kwargs)
RuntimeError: Broken pipe

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/bin/torchrun", line 8, in <module>
    return f(*args, **kwargs)
  File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/run.py", line 761, in main
    sys.exit(main())
  File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    sys.exit(main())
  File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/run.py", line 761, in main
    run(args)
  File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/run.py", line 752, in run
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 236, in launch_agent
    elastic_launch(
  File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    elastic_launch(
  File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 236, in launch_agent
    elastic_launch(
  File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    result = agent.run()

  File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", line 678, in _initialize_workers
  File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", line 678, in _initialize_workers
    result = f(*args, **kwargs)
  File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", line 678, in _initialize_workers
    self._rendezvous(worker_group)
  File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper
    result = f(*args, **kwargs)
  File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-
    get_response = self._backend.get_state()
  File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state
    raise RendezvousConnectionError(
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.
    has_set = self._state_holder.sync()
  File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 408, in sync
    base64_state: bytes = self._call_store("get", self._key)
  File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store
    get_response = self._backend.get_state()
    raise RendezvousConnectionError(
  File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.
WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'r2i7n6.ib0.icexa.epcc.ed.ac.uk_3830075_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError.
Traceback (most recent call last):
  File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store
    return getattr(self._store, store_op)(*args, **kwargs)
RuntimeError: Broken pipe

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/run.py", line 761, in main
    run(args)
  File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 236, in launch_agent
    result = agent.run()
  File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper
    result = f(*args, **kwargs)
  File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", line 678, in _initialize_workers
    self._rendezvous(worker_group)
  File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper
    result = f(*args, **kwargs)
  File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", line 538, in _rendezvous
    store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()
  File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1024, in next_rendezvous
    self._op_executor.run(join_op, deadline)
  File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 606, in run
    has_set = self._state_holder.sync()
  File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 408, in sync
    get_response = self._backend.get_state()
  File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state
    base64_state: bytes = self._call_store("get", self._key)
  File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store
    raise RendezvousConnectionError(
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.
srun: error: r2i7n7: tasks 4-6: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=5334019.0
srun: error: r2i7n6: tasks 0-2: Exited with exit code 1

The code is very simple from https://github.com/pytorch/examples/blob/main/distributed/ddp-tutorial-series/multinode.py

I used the Slurm to submit the job:

#SBATCH --job-name=f2h_SD_nnodes
#SBATCH --partition=gpu
#SBATCH --qos=gpu
#SBATCH --gres=gpu:4
#SBATCH --nodes=2
#SBATCH --exclusive
#SBATCH --time=96:00:00

source /work/ec204/ec204/qwang33/myenv/bin/activate

export OMP_NUM_THREADS=10

master_name=$(scontrol show hostname ${SLURM_NODELIST} | head -n 1)
master_addr="$(host ${master_name})"
master_ipaddr=$(echo "$master_addr" | awk '{print $NF}')
export MASTER_ADDR=$master_ipaddr
export MASTER_PORT=$(expr 10000 + $(echo -n $SLURM_JOBID | tail -c 4))
echo "MASTER_ADDR:MASTER_PORT="${MASTER_ADDR}:${MASTER_PORT}

#export NCCL_IB_DISABLE=1
#export NCCL_P2P_DISABLE=1
export LOGLEVEL=INFO
export NCCL_DEBUG=INFO
export NCCL_SOCKET_IFNAME=ib0

srun --nodes=2 --ntasks=8 --ntasks-per-node=4 --cpus-per-task=5 \
     torchrun --nnodes=2 --nproc_per_node=4 \
     --rdzv_backend=c10d --rdzv_endpoint $MASTER_ADDR:$MASTER_PORT \
     test_ddp.py 1000 100 --batch_size=32

The environment I used is:

python/3.9.13
pytorch/1.12.1
CUDA: nvidia/nvhpc/22.11

In addition, I tried this with pytorch 2.0 and I got extra two lines at the very beginning of the output, and the rest of the output is pretty much the same. Any tips?

master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.

I have sorted this out. I have to embed the srun in a for loop, allowing srun to allocate all available resources of one specific node to each torchrun.

for nid in ${nodelist}; do
    echo "nid=${nid}"
    srun --nodelist=${nid} --nodes=1 --ntasks=1 --ntasks-per-node=1 --cpus-per-task=40 --exact \
        torchrun --nnodes=$total_nodes --nproc_per_node=4 \
                 --rdzv_id=${SLURM_JOBID} --rdzv_backend=c10d --rdzv_endpoint ${MASTER_ADDR}:${MASTER_PORT} \
                 train.py --use_ema --save_content &
#                 test_ddp.py 50 50 --batch_size=32 &

    sleep 5
done

I have similar requirements, did not use srun in for loop.
sbatch file

#!/bin/bash
#SBATCH --partition=
#SBATCH -N 2
#SBATCH --gres=gpu:8
#SBATCH --mem=0
#SBATCH --ntasks-per-node 1
#SBATCH -c 128

export nodes=$SLURM_JOB_NUM_NODES
export master=$(scontrol show hostnames | head -n 1)

srun torchrun --nnodes=$nodes --nproc_per_node=8 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=$master:29400 elastic_ddp.py

I think the issue is with --ntasks-per-node=4 in your first code. I could reproduce the same error if I change #SBATCH --ntasks-per-node to 8 in my sbatch file.
ntasks-per-node is to run parallel srun tasks. Here there is only one srun task and it should be 1.
I see that in your latest code, you changed --ntasks-per-node=1, that should work without the “for loop” to allocate srun manually to each node.

Please try with below changes
In sbatch file, add #SBATCH --ntasks-per-node 1

srun torchrun --nnodes=$total_nodes --nproc_per_node=4
–rdzv_id=${SLURM_JOBID} --rdzv_backend=c10d --rdzv_endpoint ${MASTER_ADDR}:${MASTER_PORT}
train.py --use_ema --save_content &