Failed : Network is unreachable/ NCCL Error/ distributed training

When I train a network on two nodes via distributed training tools in pytorch, I encontour a nccl error. I have browsed some websites but did not find the corresponding solution.

miracle-103:23833:23833 [0] NCCL INFO Bootstrap : Using [0]eno2:10.29.151.103<0>
miracle-103:23833:23833 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

miracle-103:23833:23833 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
miracle-103:23833:23833 [0] NCCL INFO NET/Socket : Using [0]eno2:10.29.151.103<0>
miracle-103:23833:23833 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda10.1
miracle-103:23833:23887 [0] NCCL INFO Channel 00/02 :    0   1
miracle-103:23833:23887 [0] NCCL INFO Channel 01/02 :    0   1
miracle-103:23833:23887 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
miracle-103:23833:23887 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1|-1->0->1/-1/-1 [1] -1/-1/-1->0->1|1->0->-1/-1/-1
miracle-103:23833:23887 [0] NCCL INFO Setting affinity for GPU 0 to ffff,0000ffff
miracle-103:23833:23887 [0] NCCL INFO Channel 00 : 1[1c000] -> 0[1b000] [receive] via NET/Socket/0
miracle-103:23833:23887 [0] NCCL INFO Channel 00 : 0[1b000] -> 1[1c000] [send] via NET/Socket/0

miracle-103:23833:23887 [0] include/socket.h:403 NCCL WARN Connect to fe80::f803:66ff:febd:d32d%92<45371> failed : Network is unreachable
miracle-103:23833:23887 [0] NCCL INFO transport/net_socket.cc:313 -> 2
miracle-103:23833:23887 [0] NCCL INFO include/net.h:21 -> 2
miracle-103:23833:23887 [0] NCCL INFO transport/net.cc:161 -> 2
miracle-103:23833:23887 [0] NCCL INFO transport.cc:68 -> 2
miracle-103:23833:23887 [0] NCCL INFO init.cc:766 -> 2
miracle-103:23833:23887 [0] NCCL INFO init.cc:840 -> 2
miracle-103:23833:23887 [0] NCCL INFO group.cc:73 -> 2 [Async thread]
Traceback (most recent call last):
  File "tools/train.py", line 167, in <module>
    main()
  File "tools/train.py", line 135, in main
    model.init_weights()
  File "/home1/user/anaconda3/envs/mmseg/lib/python3.7/site-packages/mmcv/runner/base_module.py", line 132, in init_weights
    m.init_weights()
  File "/home1/user/Project5_T/mmsegmentation-master/mmseg/models/backbones/Mydeit.py", line 191, in init_weights
    checkpoint = _load_checkpoint(self.pretrained, logger=logger, map_location='cpu')
  File "/home1/user/anaconda3/envs/mmseg/lib/python3.7/site-packages/mmcv/runner/checkpoint.py", line 451, in _load_checkpoint
    return CheckpointLoader.load_checkpoint(filename, map_location, logger)
  File "/home1/user/anaconda3/envs/mmseg/lib/python3.7/site-packages/mmcv/runner/checkpoint.py", line 244, in load_checkpoint
    return checkpoint_loader(filename, map_location)
  File "/home1/user/anaconda3/envs/mmseg/lib/python3.7/site-packages/mmcv/runner/checkpoint.py", line 286, in load_from_http
    torch.distributed.barrier()
  File "/home1/user/anaconda3/envs/mmseg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 2420, in barrier
    work = default_pg.barrier(opts=opts)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1616554827596/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
Killing subprocess 23833
Traceback (most recent call last):
  File "/home1/user/anaconda3/envs/mmseg/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home1/user/anaconda3/envs/mmseg/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home1/user/anaconda3/envs/mmseg/lib/python3.7/site-packages/torch/distributed/launch.py", line 340, in <module>
    main()
  File "/home1/user/anaconda3/envs/mmseg/lib/python3.7/site-packages/torch/distributed/launch.py", line 326, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/home1/user/anaconda3/envs/mmseg/lib/python3.7/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/home1/user/anaconda3/envs/mmseg/bin/python', '-u', 'tools/train.py', '--local_rank=0', 'configs/mysegmenter/mysegmenter_Mydeit_Linear_512x512_160k_b8_ade20k.py', '--launcher', 'pytorch']' returned non-zero exit status 1.