DDP Error on multi node CPU training

Hi,

I followed this tutorial PyTorch Distributed Training - Lei Mao's Log Book
and modified some of the code to accommodate CPU training since the nodes don’t have GPU. My code is using gloo and I changed the device to CPU. However, I keep running into the following error at the end of the training:

ERROR:torch.distributed.elastic.agent.server.api:Error waiting on exit barrier. Elapsed: 0.060767173767089844 seconds
Traceback (most recent call last):
File “/usr/local/lib/python3.6/dist-packages/torch/distributed/elastic/agent/server/api.py”, line 904, in _exit_barrier
barrier_timeout=self._exit_barrier_timeout,
File “/usr/local/lib/python3.6/dist-packages/torch/distributed/elastic/utils/store.py”, line 67, in barrier
synchronize(store, data, rank, world_size, key_prefix, barrier_timeout)
File “/usr/local/lib/python3.6/dist-packages/torch/distributed/elastic/utils/store.py”, line 53, in synchronize
agent_data = get_all(store, key_prefix, world_size)
File “/usr/local/lib/python3.6/dist-packages/torch/distributed/elastic/utils/store.py”, line 31, in get_all
data = store.get(f"{prefix}{idx}")
RuntimeError: Connection reset by peer

I suspect it is caused by the None device id and prefix of CPUs Is there a way I can do DDP on multiple nodes using only CPUs?

Thanks!

cc @H-Huang @rvarm1 looks like multiple people hit similar issues about connection reset by peer

The same happens to me on multi node GPU training as well, it doesn’t seem to happen all the time, but when it happens it would make the node fail and not able to restart on its own.