I’m trying to add support to spread my batches across multiple NVIDIA GPU’s when training.
To do this I’ve followed PyTorch documentation outlining that I must start a process group. This is done like so:
dist.init_process_group( backend="gloo", init_method="file:///C:/Users/thefi/Voice-Cloning-App-MGPU/distributed_logging", world_size=2, timeout=datetime.timedelta(0, 180), rank=0 )
distributed_logging file is generated but after 3 minutes I get the following error:
dist.init_process_group(\n File \"C:\\Users\\thefi\\MiniConda3\\envs\\vc\\lib\\site-packages\\torch\\distributed\\distributed_c10d.py\", line 439, in init_process_group\n _default_pg = _new_process_group_helper(\n File \"C:\\Users\\thefi\\MiniConda3\\envs\\vc\\lib\\site-packages\\torch\\distributed\\distributed_c10d.py\", line 517, in _new_process_group_helper\n pg = ProcessGroupGloo(\nRuntimeError: Wait timeout\n"}
Unfortunately, it does not specify why the timeout occurs.
Other things to note:
- This training process is being imported as a function in a Thread and not run from the command line
- This is on windows, therefore I can only use
glooas the backend