Hi,
I use torch elastic to run jobs with multiple workers. I have been seeing cases where a worker restarts and joins back into the group, but quickly encounters RendezvousClosedError
which causes it to fail. The rest of the group is still running no problem, but then because this one node failed, it causes the job to fail.
I use torchrun like: torch.distributed.run --nproc_per_node=1 --node_rank=0 --nnodes=8:8 --master_addr=10.8.68.146 --master_port=2222 --rdzv_endpoint=10.8.68.146:2222 --rdzv_id=0 --rdzv_backend=c10d --rdzv_conf=join_timeout=3600,read_timeout=3600
Would appreciate if there’s any pointers to the setup. Thanks!