When running FSDP or DDP training on a multi-node NVIDIA DGX cluster with 8 GPUs per node, standard DGX network configuration which is , GPU-GPU has 8 network interfaces (backend) and each node has 2 additional network interfaces (frontend)
is it possible for NCCL or PyTorch Distributed mechanism to route around a link failure on one of the backend network interfaces ? Does such a failure require restarting the job from a saved checkpoint ?
I am unable to find relevant documentation on fault tolerance related to Network Link Failure recovery or re-routing of traffic. Greatly appreciate pointers.