Upgrading from 1.8 to 1.10.2 causes DDP to fail, socket timeout

divinho · March 6, 2022, 12:56pm

I’m doing multinode DDP, one process per node. Upgraded pytorch causes problems for some reason.

The specific error is a timeout after calling barrier()

  File "/remote/idiap.svm/temp.speech01/rbraun/code/speechbrain/speechbrain/utils/distributed.py", line 104, in ddp_barrier
    torch.distributed.barrier()
  File "/idiap/temp/rbraun/programs/anaconda3/envs/speech/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2709, in barrier
    work = default_pg.barrier(opts=opts)
RuntimeError: Socket Timeout

Why would an upgrade to pytorch cause this?

pritamdamania87 · March 7, 2022, 7:27pm

@divinho Any chance you could share a minimal repro for this passing on 1.8 and failing on 1.10.2? Also, I’m assuming you are using ProcessGroupGloo here?

divinho · March 15, 2022, 11:35am

Installing 1.10.1 with cuda 11.3 fixed the problem. think the problem might have been caused by the framework I was using (speechbrain)