Timed out initializing process group in store based barrier

Update: reboot my instance and it works, wondering if there’s a better solution?

@EDENP, has this issue persisted for you? I’m also using PyTorch 1.8.1 and experiencing this issue when submitting a distributed training job with 2 nodes, each having 4 GPUs. I eventually get the message: Timed out initializing process group in store based barrier on rank: 4, for key: store_based_barrier_key:1 (world_size=8, worker_count=2, timeout=0:30:00). Thanks for any help.

1 Like

I’m also experiencing this exact same error when trying to run on a single node with 8 GPUs. I’ve tried rebooting my machine and setting export NCCL_IB_DISABLE=1 but neither of those have worked.

I am alse experiencing this when try to run on 2 nodes with 2 GPUs per node. pytorch 1.9+cudnn cuda 11.1 ,I have set export NCCL_IB_DISABLE=1 and NCCL_SOCKET_IFNAME=eth0 but those didn’t work. by the way ,they can work well in single node with 2 GPUs.

i am having exact the same issue when run distributed training with 2 nodes, have you solved this problem?

Same problem, any updates?

Hello, have you found a suitable solution to this problem.

Any updates? Same problem here.

Same problem here. Would appreciate any updates!

I’m getting this exact same problem. Are there any updates for this?

I’m receiving the same error message

same issue here using 2 nodes and 3 gpu per node

Is there any solution to this? I am facing same error with torch 1.11.0

I am facing the same problem. Did anyone manage to solve it?

Same problem! anyone from the developers? @ptrblck

I would recommend to check these docs and try to debug the application with the env variables etc.
Unfortunately, this topic does not contain enough information besides users claiming they are running into the same issue, which is not actionable.
Let me know once you have more logging information which you could post here.

Sucks that nobody has an answer for this. I am getting the same error out of the blue and just using a multi-gpu single machine. I suspect it has something to do with memory but really don’t know how to track it down.

1 Like

same error, but after set the NCCL_SOCKET_IFNAME environment, problem solved.

1 Like

Hey, what did you set this variable to?

I had the same issue trying to parallel process multiple models for model combination.

I fixed the problem for me by NEVER calling destroy_process_group(). Instead, once I call init_process_group(…), I only check if the process_group has been initialised with torch.distributed.is_initialized().

I observe that if you ever destroy a process group, I cannot initalise anymore, because I run into this timeout.