Set longer timeout for torch distributed training

ralzq01 · November 4, 2022, 3:26am

Hi, I wonder if there is a mechanism to synchronize all processes with unlimited waiting time.

The scenario is in distributed training where one of processes in each node needs to deal with some CPU-related tasks, while other processes keep waiting until finish. Currently I use torch.distributed.barrier() (with nccl backend) and find it will timeout in half an hour. Is there another way that can make synchronize on all processes without timeout limitations? It would be better if torch already provides something like this.

wanchaol · November 8, 2022, 6:10pm

Could you try Distributed communication package - torch.distributed — PyTorch 1.13 documentation instead?

weichenrs · November 27, 2023, 2:54am

Worked for me! Thank you!