Timed out initializing process group in store based barrier

I am trying to train a model with PyTorch 1.8.1, but I run into a problem that causes my training to interrupt every 30 minutes ( I have to restart the training every 30 minutes from the checkout point):

...

File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 525, in init_process_group
        dist.init_process_group(dist.init_process_group(

  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 525, in init_process_group
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 525, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 212, in _store_based_barrier
    raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 6, for key: store_based_barrier_key:1 (world_size=1, worker_count=8, timeout=0:30:00)
    _store_based_barrier(rank, store, timeout)
      File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 212, in _store_based_barrier
_store_based_barrier(rank, store, timeout)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 212, in _store_based_barrier
    raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 7, for key: store_based_barrier_key:1 (world_size=1, worker_count=8, timeout=0:30:00)
    raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 5, for key: store_based_barrier_key:1 (world_size=1, worker_count=8, timeout=0:30:00)

This seems to be related to the ‘timeout’ parameter which is set as ‘default_pg_timeout’(1800 seconds) in ‘init_process_group’ function by default, as suggested in the source code for ‘torch.distributed.distributed_c10d’.

So I modified my code as:

from datetime import timedelta
timeout=timedelta(seconds=86400)     # default_pg_timeout is timedelta(seconds=1800)
...

dist.init_process_group(
        backend=backend,
        init_method='tcp://127.0.0.1:%d' % tcp_port,
        rank=local_rank,
        world_size=num_gpus,
        timeout=timeout
    )

But it didn’t work (maybe?). Could someone help me plz?:sob:

3 Likes

Update: reboot my instance and it works, wondering if there’s a better solution?

@EDENP, has this issue persisted for you? I’m also using PyTorch 1.8.1 and experiencing this issue when submitting a distributed training job with 2 nodes, each having 4 GPUs. I eventually get the message: Timed out initializing process group in store based barrier on rank: 4, for key: store_based_barrier_key:1 (world_size=8, worker_count=2, timeout=0:30:00). Thanks for any help.

1 Like

I’m also experiencing this exact same error when trying to run on a single node with 8 GPUs. I’ve tried rebooting my machine and setting export NCCL_IB_DISABLE=1 but neither of those have worked.

I am alse experiencing this when try to run on 2 nodes with 2 GPUs per node. pytorch 1.9+cudnn cuda 11.1 ,I have set export NCCL_IB_DISABLE=1 and NCCL_SOCKET_IFNAME=eth0 but those didn’t work. by the way ,they can work well in single node with 2 GPUs.

i am having exact the same issue when run distributed training with 2 nodes, have you solved this problem?

Same problem, any updates?

Hello, have you found a suitable solution to this problem.

Any updates? Same problem here.

Same problem here. Would appreciate any updates!

I’m getting this exact same problem. Are there any updates for this?

I’m receiving the same error message

same issue here using 2 nodes and 3 gpu per node

Is there any solution to this? I am facing same error with torch 1.11.0

I am facing the same problem. Did anyone manage to solve it?

Same problem! anyone from the developers? @ptrblck

I would recommend to check these docs and try to debug the application with the env variables etc.
Unfortunately, this topic does not contain enough information besides users claiming they are running into the same issue, which is not actionable.
Let me know once you have more logging information which you could post here.