Hello! The timeout is during initialization so it seems that not all workers are joining the group. Two follow up questions:
You are using world_size=2 and you specify rank=0, are you also initializing another worker with rank=1? You can do this by creating multiple processes in a docker container, or running two docker containers and changing the code to use different ranks for each container.
If you are using multiple docker containers, since you are using the file init method, you should ensure that both containers can read/write to the same file. You can do this by using a shared volume between containers.
Could you add logging and make sure that both your processes reach the same state before torch.distributed.init_process_group produces an error?
Also, could you make sure that both processes has w access to the directory that contains rendezvous_file and that the file does not exist before you run torch.distributed.init_process_group