For managing torchrun
across nodes and ensuring fault tolerance in Fully Sharded Data Parallel (FSDP) training, which would be the best option:
- Kubernetes
- Ray
- Slurm
I’m looking for the approach that aligns best with current industry practices for scalability, fault tolerance, and speed in distributed deep learning. For instance, in terms of speed, I’ve read that Kubernetes can add ~2% overhead due to containerization but Slurm doesn’t have this issue.