Upgrading from 1.8 to 1.10.2 causes DDP to fail, socket timeout

I’m doing multinode DDP, one process per node. Upgraded pytorch causes problems for some reason.

The specific error is a timeout after calling barrier()

  File "/remote/idiap.svm/temp.speech01/rbraun/code/speechbrain/speechbrain/utils/distributed.py", line 104, in ddp_barrier
    torch.distributed.barrier()
  File "/idiap/temp/rbraun/programs/anaconda3/envs/speech/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2709, in barrier
    work = default_pg.barrier(opts=opts)
RuntimeError: Socket Timeout

Why would an upgrade to pytorch cause this?

@divinho Any chance you could share a minimal repro for this passing on 1.8 and failing on 1.10.2? Also, I’m assuming you are using ProcessGroupGloo here?

Installing 1.10.1 with cuda 11.3 fixed the problem. think the problem might have been caused by the framework I was using (speechbrain)

1 Like