Process unexpectedly hangs up in torch.distributed.init_process_group()

wuchiz · February 1, 2021, 2:44pm

I trained my model in two nodes, and then it hangs up in initiation.

then I add NCCL_DEBUG=INFO and NCCL_DEBUG_SUBSYS=all to see what’s going on, but there is no output file.

mrshenli · February 1, 2021, 4:16pm

Was there any error message? Does it behave differently if you replace gpu10 with its IP address?

then I add NCCL_DEBUG=INFO and NCCL_DEBUG_SUBSYS=all to see what’s going on, but there is no output file.

It’s possible that init_process_group failed at rendezvous stage (process ip/port discovery using the master as a leader), so that it has not reached NCCL code yet.

wuchiz · February 1, 2021, 11:33pm

no differences. and it just remind me of connect() time out. But when I trained it on a single node with 2 gpus using CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 alphachem_main.py, it works well.

mrshenli · February 2, 2021, 7:51pm

This probably means the two machines cannot talk to each other using the given configuration. Have you tried setting NCCL_SOCKET_IFNAME to point to the correct NIC?

wuchiz · February 2, 2021, 11:23pm

It still hangs up in initiation step after setting this environment variable to eth0

wuchiz · February 3, 2021, 3:15am

I know what’s wrong. the name network interface is eno1 instead of eth0. However it still hangs up in this step. the error is shown in the following figure.