Set longer timeout for torch distributed training

Hi, I wonder if there is a mechanism to synchronize all processes with unlimited waiting time.

The scenario is in distributed training where one of processes in each node needs to deal with some CPU-related tasks, while other processes keep waiting until finish. Currently I use torch.distributed.barrier() (with nccl backend) and find it will timeout in half an hour. Is there another way that can make synchronize on all processes without timeout limitations? It would be better if torch already provides something like this.

2 Likes

Could you try Distributed communication package - torch.distributed — PyTorch 1.13 documentation instead?

1 Like

Worked for me! Thank you!