Timed out initializing process group in store based barrier

I am trying to train a model with PyTorch 1.8.1, but I run into a problem that causes my training to interrupt every 30 minutes ( I have to restart the training every 30 minutes from the checkout point):

...

File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 525, in init_process_group
        dist.init_process_group(dist.init_process_group(

  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 525, in init_process_group
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 525, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 212, in _store_based_barrier
    raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 6, for key: store_based_barrier_key:1 (world_size=1, worker_count=8, timeout=0:30:00)
    _store_based_barrier(rank, store, timeout)
      File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 212, in _store_based_barrier
_store_based_barrier(rank, store, timeout)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 212, in _store_based_barrier
    raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 7, for key: store_based_barrier_key:1 (world_size=1, worker_count=8, timeout=0:30:00)
    raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 5, for key: store_based_barrier_key:1 (world_size=1, worker_count=8, timeout=0:30:00)

This seems to be related to the ‘timeout’ parameter which is set as ‘default_pg_timeout’(1800 seconds) in ‘init_process_group’ function by default, as suggested in the source code for ‘torch.distributed.distributed_c10d’.

So I modified my code as:

from datetime import timedelta
timeout=timedelta(seconds=86400)     # default_pg_timeout is timedelta(seconds=1800)
...

dist.init_process_group(
        backend=backend,
        init_method='tcp://127.0.0.1:%d' % tcp_port,
        rank=local_rank,
        world_size=num_gpus,
        timeout=timeout
    )

But it didn’t work (maybe?). Could someone help me plz?:sob:

2 Likes

Update: reboot my instance and it works, wondering if there’s a better solution?

@EDENP, has this issue persisted for you? I’m also using PyTorch 1.8.1 and experiencing this issue when submitting a distributed training job with 2 nodes, each having 4 GPUs. I eventually get the message: Timed out initializing process group in store based barrier on rank: 4, for key: store_based_barrier_key:1 (world_size=8, worker_count=2, timeout=0:30:00). Thanks for any help.

1 Like

I’m also experiencing this exact same error when trying to run on a single node with 8 GPUs. I’ve tried rebooting my machine and setting export NCCL_IB_DISABLE=1 but neither of those have worked.

I am alse experiencing this when try to run on 2 nodes with 2 GPUs per node. pytorch 1.9+cudnn cuda 11.1 ,I have set export NCCL_IB_DISABLE=1 and NCCL_SOCKET_IFNAME=eth0 but those didn’t work. by the way ,they can work well in single node with 2 GPUs.

i am having exact the same issue when run distributed training with 2 nodes, have you solved this problem?

Same problem, any updates?

Hello, have you found a suitable solution to this problem.

Any updates? Same problem here.

Same problem here. Would appreciate any updates!