Hi there,
I’m trying to add support to spread my batches across multiple NVIDIA GPU’s when training.
To do this I’ve followed PyTorch documentation outlining that I must start a process group. This is done like so:
dist.init_process_group(
backend="gloo",
init_method="file:///C:/Users/thefi/Voice-Cloning-App-MGPU/distributed_logging",
world_size=2,
timeout=datetime.timedelta(0, 180),
rank=0
)
The distributed_logging
file is generated but after 3 minutes I get the following error:
dist.init_process_group(\n File \"C:\\Users\\thefi\\MiniConda3\\envs\\vc\\lib\\site-packages\\torch\\distributed\\distributed_c10d.py\", line 439, in init_process_group\n _default_pg = _new_process_group_helper(\n File \"C:\\Users\\thefi\\MiniConda3\\envs\\vc\\lib\\site-packages\\torch\\distributed\\distributed_c10d.py\", line 517, in _new_process_group_helper\n pg = ProcessGroupGloo(\nRuntimeError: Wait timeout\n"}
Unfortunately, it does not specify why the timeout occurs.
Other things to note:
- This training process is being imported as a function in a Thread and not run from the command line
- This is on windows, therefore I can only use
gloo
as the backend