I’m doing multinode DDP, one process per node. Upgraded pytorch causes problems for some reason.
The specific error is a timeout after calling barrier()
File "/remote/idiap.svm/temp.speech01/rbraun/code/speechbrain/speechbrain/utils/distributed.py", line 104, in ddp_barrier
torch.distributed.barrier()
File "/idiap/temp/rbraun/programs/anaconda3/envs/speech/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2709, in barrier
work = default_pg.barrier(opts=opts)
RuntimeError: Socket Timeout
Why would an upgrade to pytorch cause this?