Hi, I wonder if there is a mechanism to synchronize all processes with unlimited waiting time.
The scenario is in distributed training where one of processes in each node needs to deal with some CPU-related tasks, while other processes keep waiting until finish. Currently I use
torch.distributed.barrier() (with nccl backend) and find it will timeout in half an hour. Is there another way that can make synchronize on all processes without timeout limitations? It would be better if torch already provides something like this.