Set longer timeout for torch distributed training

Hi, I wonder if there is a mechanism to synchronize all processes with unlimited waiting time.

The scenario is in distributed training where one of processes in each node needs to deal with some CPU-related tasks, while other processes keep waiting until finish. Currently I use torch.distributed.barrier() (with nccl backend) and find it will timeout in half an hour. Is there another way that can make synchronize on all processes without timeout limitations? It would be better if torch already provides something like this.

2 Likes

Could you try Distributed communication package - torch.distributed — PyTorch 1.13 documentation instead?

3 Likes

Worked for me! Thank you!

this is not a solution to the question. monitored_barrier is implemented only for GLOO, whereas the question specifically states they’re using NCCL.

1 Like

Hello,
I try this and it really worked, Thanks for sharing.

Hello, I found init_process_group works for my scenario. You can customize the timeout with timeout argument. Then all the NCCL operations under this group will be with this timeout.

1 Like