Can infiniband accelerate distributed training without GPUDirect?

GeoffreyChen777 · May 11, 2019, 9:37am

I have two 4x2080ti machines. I want to train my model by NCCL distributed backend. But the training is slow because these two machines are connected by a 1000M ethernet card.
So I want to use two infiniband cards to connect these two machines.
But my GPU is a GeForce not a Tesla. The question is, can infiniband accelerate the training if the GPU don’t support GPUDirect?

Thanks.

pietern · May 21, 2019, 3:53pm

In theory, yes. As long as you get cards with a higher bandwidth than your Ethernet setup it should result in an improvement. But since NCCL is built for using GPUDirect, I’m not sure if it will work with NCCL out of the box. If it doesn’t, you could try and experiment with IPoIB and fall back to using NCCL’s TCP transport.

Good luck!