Init_process_group times out without an error (ProcessGroupGloo)

Hi there,

I’m trying to add support to spread my batches across multiple NVIDIA GPU’s when training.
To do this I’ve followed PyTorch documentation outlining that I must start a process group. This is done like so:

dist.init_process_group(
    backend="gloo", 
    init_method="file:///C:/Users/thefi/Voice-Cloning-App-MGPU/distributed_logging",
    world_size=2, 
    timeout=datetime.timedelta(0, 180),
    rank=0
)

The distributed_logging file is generated but after 3 minutes I get the following error:

dist.init_process_group(\n  File \"C:\\Users\\thefi\\MiniConda3\\envs\\vc\\lib\\site-packages\\torch\\distributed\\distributed_c10d.py\", line 439, in init_process_group\n    _default_pg = _new_process_group_helper(\n  File \"C:\\Users\\thefi\\MiniConda3\\envs\\vc\\lib\\site-packages\\torch\\distributed\\distributed_c10d.py\", line 517, in _new_process_group_helper\n    pg = ProcessGroupGloo(\nRuntimeError: Wait timeout\n"}

Unfortunately, it does not specify why the timeout occurs.
Other things to note:

  1. This training process is being imported as a function in a Thread and not run from the command line
  2. This is on windows, therefore I can only use gloo as the backend

If you are using world_size=2, you need two processes (one with rank 0 and one with rank 1) otherwise with a single process you will see a timeout caused due to the system waiting for one missing process.

Could you elaborate on what you mean. Do you mean I need to running instances of this code in different processes?

Do you mean I need to running instances of this code in different processes?

Yes, you can refer to these docs for concrete examples: Writing Distributed Applications with PyTorch — PyTorch Tutorials 1.8.1+cu102 documentation