Timeout in distribuuted init process group

Alex_Rak · April 5, 2021, 11:20pm

I’m try run

torch.distributed.init_process_group('nccl', world_size=2, rank=0, init_method='file://' + os.path.abspath('./dummy'))

inside docker container, that’s started with command:

docker run --runtime=nvidia --network="host" --shm-size 1g -v [from]:[to] -i -t [image] /bin/bash

But have RuntimeError:

RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1

Can you help me please?

H-Huang · April 6, 2021, 4:25am

Hello! The timeout is during initialization so it seems that not all workers are joining the group. Two follow up questions:

You are using world_size=2 and you specify rank=0, are you also initializing another worker with rank=1? You can do this by creating multiple processes in a docker container, or running two docker containers and changing the code to use different ranks for each container.
If you are using multiple docker containers, since you are using the file init method, you should ensure that both containers can read/write to the same file. You can do this by using a shared volume between containers.

Alex_Rak · April 6, 2021, 10:17pm

Yes, I try make it with https://github.com/facebookresearch/denoiser/blob/master/denoiser/executor.py#L78 and https://github.com/facebookresearch/denoiser/blob/master/denoiser/distrib.py#L34

No, it’s simple case with only one docker launch.

agolynski · April 9, 2021, 5:53pm

Could you add logging and make sure that both your processes reach the same state before torch.distributed.init_process_group produces an error?

Also, could you make sure that both processes has w access to the directory that contains rendezvous_file and that the file does not exist before you run torch.distributed.init_process_group

https://pytorch.org/docs/stable/distributed.html#shared-file-system-initialization