`RendezvousClosedError` when a node restarts and join the group

Hi,

I use torch elastic to run jobs with multiple workers. I have been seeing cases where a worker restarts and joins back into the group, but quickly encounters RendezvousClosedError which causes it to fail. The rest of the group is still running no problem, but then because this one node failed, it causes the job to fail.

I use torchrun like: torch.distributed.run --nproc_per_node=1 --node_rank=0 --nnodes=8:8 --master_addr=10.8.68.146 --master_port=2222 --rdzv_endpoint=10.8.68.146:2222 --rdzv_id=0 --rdzv_backend=c10d --rdzv_conf=join_timeout=3600,read_timeout=3600

Would appreciate if there’s any pointers to the setup. Thanks!

I assume since nnodes=8:8 that means that min-max are both 8 nodes, so it tolerates no failures. If one of the node fails, then the whole group must restart. Shouldn’t the number of restarts be >0 so that it tolerates some failures?