Process unexpectedly hangs up in torch.distributed.init_process_group()

I trained my model in two nodes, and then it hangs up in initiation.



then I add NCCL_DEBUG=INFO and NCCL_DEBUG_SUBSYS=all to see what’s going on, but there is no output file.

Was there any error message? Does it behave differently if you replace gpu10 with its IP address?

then I add NCCL_DEBUG=INFO and NCCL_DEBUG_SUBSYS=all to see what’s going on, but there is no output file.

It’s possible that init_process_group failed at rendezvous stage (process ip/port discovery using the master as a leader), so that it has not reached NCCL code yet.

no differences. and it just remind me of connect() time out. But when I trained it on a single node with 2 gpus using CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 alphachem_main.py, it works well.

This probably means the two machines cannot talk to each other using the given configuration. Have you tried setting NCCL_SOCKET_IFNAME to point to the correct NIC?

It still hangs up in initiation step after setting this environment variable to eth0

I know what’s wrong. the name network interface is eno1 instead of eth0. However it still hangs up in this step. the error is shown in the following figure.