/pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:410, unhandled system error, NCCL version 2.4.8

Hello , I submitted a 4-node task with 1GPU for each node.
But it exit with exception.
Some of the log information is as follows:
NCCL WARN Connect to 10.38.10.112<21724> failed : Connection refused
The strange thing is that none of the 4 nodes’s ip is 10.38.10.112<21724>. I don’t know why it will try to connect the ip and the port .
Besides,
I have set the NCCL_SOCKET_IFNAME to “^lo,docker”

self.dist_backend: nccl
self.dist_init_method: file:///home/storage15/huangying/tools/espnet/egs2/voxforge/asr1/vox.init
self.dist_world_size: 4
self.dist_rank: 1
auto allocate gpu device: 0
devices ids is  0

----------------------------------------------------details-------------------------------------------------------------

tj1-asr-train-v100-13:941227:943077 [0] NCCL INFO Call to connect returned Connection refused, retrying
tj1-asr-train-v100-13:941227:943077 [0] NCCL INFO Call to connect returned Connection refused, retrying
tj1-asr-train-v100-13:941227:943077 [0] NCCL INFO Call to connect returned Connection refused, retrying

tj1-asr-train-v100-13:941227:943077 [0] include/socket.h:390 NCCL WARN Connect to 10.38.10.112<21724> failed : Connection refused
tj1-asr-train-v100-13:941227:943077 [0] NCCL INFO bootstrap.cc:100 -> 2
tj1-asr-train-v100-13:941227:943077 [0] NCCL INFO bootstrap.cc:326 -> 2
tj1-asr-train-v100-13:941227:943077 [0] NCCL INFO init.cc:695 -> 2
tj1-asr-train-v100-13:941227:943077 [0] NCCL INFO init.cc:951 -> 2
tj1-asr-train-v100-13:941227:943077 [0] NCCL INFO misc/group.cc:69 -> 2 [Async thread]
/home/storage15/huangying/tools/anaconda3/envs/py36/lib/python3.6/site-packages/librosa/util/decorators.py:9: NumbaDeprecationWarning: An import was requested from a module that has moved location.
Import of 'jit' requested from: 'numba.decorators', please update to use 'numba.core.decorators' or pin to Numba version 0.48.0. This alias will not be present in Numba version 0.50.0.
  from numba.decorators import jit as optional_jit
/home/storage15/huangying/tools/anaconda3/envs/py36/bin/python3 /home/storage15/huangying/tools/espnet/espnet2/bin/asr_train.py --use_preprocessor true --bpemodel none --token_type char --token_list data/token_list/char/tokens.txt --non_linguistic_symbols none --train_data_path_and_name_and_type dump/fbank_pitch/tr_en/feats.scp,speech,kaldi_ark --train_data_path_and_name_and_type dump/fbank_pitch/tr_en/text,text,text --valid_data_path_and_name_and_type dump/fbank_pitch/dt_en/feats.scp,speech,kaldi_ark --valid_data_path_and_name_and_type dump/fbank_pitch/dt_en/text,text,text --train_shape_file exp/asr_stats/train/speech_shape --train_shape_file exp/asr_stats/train/text_shape.char --valid_shape_file exp/asr_stats/valid/speech_shape --valid_shape_file exp/asr_stats/valid/text_shape.char --resume true --fold_length 800 --fold_length 150 --output_dir exp/asr_train_asr_transformer_fbank_pitch_char_normalize_confnorm_varsFalse --ngpu 1 --dist_init_method file:///home/storage15/huangying/tools/espnet/egs2/voxforge/asr1/vox.init --multiprocessing_distributed false --dist_launcher queue.pl --dist_world_size 4 --config conf/train_asr_transformer.yaml --input_size=83 --normalize=global_mvn --normalize_conf stats_file=exp/asr_stats/train/feats_stats.npz --normalize_conf norm_vars=False
Traceback (most recent call last):
  File "/home/storage15/huangying/tools/anaconda3/envs/py36/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/storage15/huangying/tools/anaconda3/envs/py36/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/storage15/huangying/tools/espnet/espnet2/bin/asr_train.py", line 23, in <module>
    main()
  File "/home/storage15/huangying/tools/espnet/espnet2/bin/asr_train.py", line 19, in main
    ASRTask.main(cmd=cmd)
  File "/home/storage15/huangying/tools/espnet/espnet2/tasks/abs_task.py", line 842, in main
    cls.main_worker(args)
  File "/home/storage15/huangying/tools/espnet/espnet2/tasks/abs_task.py", line 1174, in main_worker
    distributed_option=distributed_option,
  File "/home/storage15/huangying/tools/espnet/espnet2/train/trainer.py", line 163, in run
    else None
  File "/home/storage15/huangying/tools/anaconda3/envs/py36/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 303, in __init__
    self.broadcast_bucket_size)
  File "/home/storage15/huangying/tools/anaconda3/envs/py36/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 485, in _distributed_broadcast_coalesced
    dist._broadcast_coalesced(self.process_group, tensors, buffer_size)
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:410, unhandled system error, NCCL version 2.4.8
Setting NCCL_SOCKET_IFNAME
Finished NCCL_SOCKET_IFNAME
^lo,docker
self.dist_backend: nccl
self.dist_init_method: file:///home/storage15/huangying/tools/espnet/egs2/voxforge/asr1/vox.init
self.dist_world_size: 4
self.dist_rank: 1
auto allocate gpu device: 0
devices ids is  0
# Accounting: time=107 threads=1
# Finished at Mon May 18 14:45:28 CST 2020 with status 1

Besides,
If I use only 2 nodes, each with 1 GPU, it works well with following log. But if 3 or 4 nodes, above error accurs.

451 tj1-asr-train-v100-11:41449:41449 [6] NCCL INFO Bootstrap : Using [0]eth0:10.38.10.4<0>
452 tj1-asr-train-v100-11:41449:41449 [6] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
453 tj1-asr-train-v100-11:41449:41449 [6] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
454 tj1-asr-train-v100-11:41449:41449 [6] NCCL INFO NET/Socket : Using [0]eth0:10.38.10.4<0>
455 NCCL version 2.4.8+cuda10.0
456 tj1-asr-train-v100-11:41449:43711 [6] NCCL INFO Setting affinity for GPU 6 to 03ff,f0003fff
457 tj1-asr-train-v100-11:41449:43711 [6] NCCL INFO CUDA Dev 6[6], Socket NIC distance :  PHB
458 tj1-asr-train-v100-11:41449:43711 [6] NCCL INFO Channel 00 :    0   1
459 tj1-asr-train-v100-11:41449:43711 [6] NCCL INFO Ring 00 : 1 -> 0 [receive] via NET/Socket/0
460 tj1-asr-train-v100-11:41449:43711 [6] NCCL INFO NET/Socket: Using 1 threads and 1 sockets per thread
461 tj1-asr-train-v100-11:41449:43711 [6] NCCL INFO Ring 00 : 0 -> 1 [send] via NET/Socket/0
462 tj1-asr-train-v100-11:41449:43711 [6] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees disabled
463 tj1-asr-train-v100-11:41449:43711 [6] NCCL INFO comm 0x2b5210001cf0 rank 0 nranks 2 cudaDev 6 nvmlDev 6 - Init COMPLETE
464 tj1-asr-train-v100-11:41449:41449 [6] NCCL INFO Launch mode Parallel
465 [tj1-asr-train-v100-11:0/2] 2020-05-18 15:18:44,949 (trainer:201) INFO: 1/200epoch started
466 Setting NCCL_SOCKET_IFNAME
467 Finished NCCL_SOCKET_IFNAME
468 ^lo,docker
469 self.dist_backend: nccl
470 self.dist_init_method: file:///home/storage15/huangying/tools/espnet/egs2/voxforge/asr1/vox.init
471 self.dist_world_size: 2
472 self.dist_rank: 0
473 auto allocate gpu device: 6
474 devices ids is  6

This reminds me of a previous discussion here. Can you check in the program immediately before init_process_group if NCCL_SOCKET_IFNAME env var contains the correct value?

Below is copied from https://github.com/pytorch/pytorch/issues/38702, as we closed that issue and moved the discussion here.

The strange thing is that none of the 4 nodes’s ip is 10.38.10.112<21724>. I don’t know why it will try to connect the ip and the port.

Could you please check immediately before init_process_group in the code to confirm that MASTER_ADDR , MASTER_PORT , and NCCL_SOCKET_IFNAME are configured properly? Sometimes these can be different from what you set in command line especially when you are using notebook. You can do so by calling os.getenv().

Sharing the code will also be helpful. If the code is confidential, we can start by sharing how you invoke init_process_group .

Thank you very much for your reply.
I checked by printing the NCCL_SOCKET_IFNAME in my python script like this:

        os.environ.setdefault("NCCL_DEBUG", "INFO")
        os.environ.setdefault("NCCL_IB_DISABLE", "1")
        print("Setting NCCL_SOCKET_IFNAME")
        os.environ.setdefault("NCCL_SOCKET_IFNAME", "^lo,docker")
        print("Finished NCCL_SOCKET_IFNAME")
        print(os.environ.get("NCCL_SOCKET_IFNAME"))

        # See:
        # https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group
        os.environ.setdefault("NCCL_BLOCKING_WAIT", "1")
        print("self.dist_backend:",self.dist_backend)
        print("self.dist_init_method:", self.dist_init_method)
        print("self.dist_world_size:",self.dist_world_size)
        print("self.dist_rank:",self.dist_rank)
        torch.distributed.init_process_group(
            backend=self.dist_backend,
            init_method=self.dist_init_method,
            world_size=self.dist_world_size,
            rank=self.dist_rank,
        )

the output of one node is like this:

Setting NCCL_SOCKET_IFNAME
Finished NCCL_SOCKET_IFNAME
^lo,docker
self.dist_backend: nccl
self.dist_init_method: file:///home/storage15/huangying/tools/espnet/egs2/voxforge/asr1/vox.init
self.dist_world_size: 3
self.dist_rank: 2
MASTER_ADDR : None
MASTER_PORT : None

Besides , I use ifconfig to check the network of one of the failed node:
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 10.38.10.174 netmask 255.255.255.0 broadcast 10.38.10.255
inet6 fe80::9a03:9bff:fe0e:ff72 prefixlen 64 scopeid 0x20
ether 98:03:9b:0e:ff:72 txqueuelen 1000 (Ethernet)
RX packets 265553414516 bytes 374221992683088 (340.3 TiB)
RX errors 0 dropped 827204 overruns 0 frame 0
TX packets 101116217165 bytes 122158006104703 (111.1 TiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

eth1: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
ether 98:03:9b:0e:ff:73 txqueuelen 1000 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10
loop txqueuelen 0 (Local Loopback)
RX packets 17752724 bytes 17788869931 (16.5 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 17752724 bytes 17788869931 (16.5 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

Thanks for sharing more details. Three questions:

  1. Since you are using file:///home/storage15/huangying/tools/espnet/egs2/voxforge/asr1/vox.init as the init_method, I assume all ranks can access the same file to rendezvous?

  2. I haven’t tried this within a docker. What is the reason for setting it to ^lo,docker instead of just lo? And does it work if you set it to eth0 or eth1?

  3. What does it return when you run the following command?

getent hosts `hostname`

any solution run pytorch distributed training in docker?