DDP Error on multi node CPU training


I followed this tutorial PyTorch Distributed Training - Lei Mao's Log Book
and modified some of the code to accommodate CPU training since the nodes don’t have GPU. My code is using gloo and I changed the device to CPU. However, I keep running into the following error at the end of the training:

ERROR:torch.distributed.elastic.agent.server.api:Error waiting on exit barrier. Elapsed: 0.060767173767089844 seconds
Traceback (most recent call last):
File “/usr/local/lib/python3.6/dist-packages/torch/distributed/elastic/agent/server/api.py”, line 904, in _exit_barrier
File “/usr/local/lib/python3.6/dist-packages/torch/distributed/elastic/utils/store.py”, line 67, in barrier
synchronize(store, data, rank, world_size, key_prefix, barrier_timeout)
File “/usr/local/lib/python3.6/dist-packages/torch/distributed/elastic/utils/store.py”, line 53, in synchronize
agent_data = get_all(store, key_prefix, world_size)
File “/usr/local/lib/python3.6/dist-packages/torch/distributed/elastic/utils/store.py”, line 31, in get_all
data = store.get(f"{prefix}{idx}")
RuntimeError: Connection reset by peer

I suspect it is caused by the None device id and prefix of CPUs Is there a way I can do DDP on multiple nodes using only CPUs?


cc @H-Huang @rvarm1 looks like multiple people hit similar issues about connection reset by peer