Timed out initializing process group in store based barrier

EDENP · April 29, 2021, 10:42am

I am trying to train a model with PyTorch 1.8.1, but I run into a problem that causes my training to interrupt every 30 minutes ( I have to restart the training every 30 minutes from the checkout point):

...

File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 525, in init_process_group
        dist.init_process_group(dist.init_process_group(

  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 525, in init_process_group
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 525, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 212, in _store_based_barrier
    raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 6, for key: store_based_barrier_key:1 (world_size=1, worker_count=8, timeout=0:30:00)
    _store_based_barrier(rank, store, timeout)
      File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 212, in _store_based_barrier
_store_based_barrier(rank, store, timeout)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 212, in _store_based_barrier
    raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 7, for key: store_based_barrier_key:1 (world_size=1, worker_count=8, timeout=0:30:00)
    raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 5, for key: store_based_barrier_key:1 (world_size=1, worker_count=8, timeout=0:30:00)

This seems to be related to the ‘timeout’ parameter which is set as ‘default_pg_timeout’(1800 seconds) in ‘init_process_group’ function by default, as suggested in the source code for ‘torch.distributed.distributed_c10d’.

So I modified my code as:

from datetime import timedelta
timeout=timedelta(seconds=86400)     # default_pg_timeout is timedelta(seconds=1800)
...

dist.init_process_group(
        backend=backend,
        init_method='tcp://127.0.0.1:%d' % tcp_port,
        rank=local_rank,
        world_size=num_gpus,
        timeout=timeout
    )

But it didn’t work (maybe?). Could someone help me plz?

EDENP · April 29, 2021, 12:25pm

Update: reboot my instance and it works, wondering if there’s a better solution?

jstremme · June 14, 2021, 2:34pm

@EDENP, has this issue persisted for you? I’m also using PyTorch 1.8.1 and experiencing this issue when submitting a distributed training job with 2 nodes, each having 4 GPUs. I eventually get the message: Timed out initializing process group in store based barrier on rank: 4, for key: store_based_barrier_key:1 (world_size=8, worker_count=2, timeout=0:30:00). Thanks for any help.

joshuajnoble · July 5, 2021, 2:52pm

I’m also experiencing this exact same error when trying to run on a single node with 8 GPUs. I’ve tried rebooting my machine and setting export NCCL_IB_DISABLE=1 but neither of those have worked.

Niki-lu · August 13, 2021, 5:56am

I am alse experiencing this when try to run on 2 nodes with 2 GPUs per node. pytorch 1.9+cudnn cuda 11.1 ,I have set export NCCL_IB_DISABLE=1 and NCCL_SOCKET_IFNAME=eth0 but those didn’t work. by the way ,they can work well in single node with 2 GPUs.

zirui · November 23, 2021, 6:53am

i am having exact the same issue when run distributed training with 2 nodes, have you solved this problem?

B.W_Zhang · February 11, 2022, 4:34am

Same problem, any updates?

li_huatao · March 2, 2022, 5:19am

Hello, have you found a suitable solution to this problem.

UCDuan · March 6, 2022, 8:40pm

Any updates? Same problem here.

hyhuang00 · March 10, 2022, 8:24pm

Same problem here. Would appreciate any updates!

josejimenezluna · June 23, 2022, 12:36pm

I’m getting this exact same problem. Are there any updates for this?

Mirian_Hipolito · July 5, 2022, 9:00pm

I’m receiving the same error message

Patte_d_Encre · August 1, 2022, 9:46am

same issue here using 2 nodes and 3 gpu per node

Saurabh_Naik · August 10, 2022, 6:26pm

Is there any solution to this? I am facing same error with torch 1.11.0

Alex_Ferrando · August 18, 2022, 4:51am

I am facing the same problem. Did anyone manage to solve it?

PratikStar · September 1, 2022, 2:31am

Same problem! anyone from the developers? @ptrblck

ptrblck · September 1, 2022, 4:47am

I would recommend to check these docs and try to debug the application with the env variables etc.
Unfortunately, this topic does not contain enough information besides users claiming they are running into the same issue, which is not actionable.
Let me know once you have more logging information which you could post here.

EvanZ · December 7, 2022, 10:43pm

Sucks that nobody has an answer for this. I am getting the same error out of the blue and just using a multi-gpu single machine. I suspect it has something to do with memory but really don’t know how to track it down.

triloon · January 13, 2023, 12:34pm

same error, but after set the NCCL_SOCKET_IFNAME environment, problem solved.

Hossein_J · July 14, 2023, 4:19pm

Hey, what did you set this variable to?