Init_process_group times out without an error (ProcessGroupGloo)

BenAAndrew · April 26, 2021, 4:24pm

Hi there,

I’m trying to add support to spread my batches across multiple NVIDIA GPU’s when training.
To do this I’ve followed PyTorch documentation outlining that I must start a process group. This is done like so:

dist.init_process_group(
    backend="gloo", 
    init_method="file:///C:/Users/thefi/Voice-Cloning-App-MGPU/distributed_logging",
    world_size=2, 
    timeout=datetime.timedelta(0, 180),
    rank=0
)

The distributed_logging file is generated but after 3 minutes I get the following error:

dist.init_process_group(\n  File \"C:\\Users\\thefi\\MiniConda3\\envs\\vc\\lib\\site-packages\\torch\\distributed\\distributed_c10d.py\", line 439, in init_process_group\n    _default_pg = _new_process_group_helper(\n  File \"C:\\Users\\thefi\\MiniConda3\\envs\\vc\\lib\\site-packages\\torch\\distributed\\distributed_c10d.py\", line 517, in _new_process_group_helper\n    pg = ProcessGroupGloo(\nRuntimeError: Wait timeout\n"}

Unfortunately, it does not specify why the timeout occurs.
Other things to note:

This training process is being imported as a function in a Thread and not run from the command line
This is on windows, therefore I can only use gloo as the backend

pritamdamania87 · April 26, 2021, 6:29pm

If you are using world_size=2, you need two processes (one with rank 0 and one with rank 1) otherwise with a single process you will see a timeout caused due to the system waiting for one missing process.

BenAAndrew · April 26, 2021, 6:33pm

Could you elaborate on what you mean. Do you mean I need to running instances of this code in different processes?

pritamdamania87 · April 27, 2021, 9:18pm

Do you mean I need to running instances of this code in different processes?

Yes, you can refer to these docs for concrete examples: Writing Distributed Applications with PyTorch — PyTorch Tutorials 2.1.1+cu121 documentation