Can anyone share experiences in terms of needed bandwith to scale with DDP. Is 25GBe enough or do we need 40GBe, 50GBe, 100GBe, 200 GBe? Our target is a small cluster of 4 nodes with 8 GPUs.
Is GPUDirect RDMA a must have for our setup?
What about GPUDirect storage? Is this needed?
Any other considerations we should make to scale well with pytorch DDP?
For such a small cluster I think 25 GBe should be enough, but it depends on the size of your model and the parameter gradients that need to be synchronized. GPUDirect RDMA and GPUDirect storage are not absolutely necessary to use DDP. It would be best to consider measuring the DDP forward, backward, and optimizer step delay, then find the bottleneck and see if you can optimize from there.