Multi-Node Multi-GPU - How to distribute model training on multi-node with multi-gpu?

bkuriach · June 3, 2021, 7:39pm

I have 3 nodes each with 2 GPUs, how can I distribute my model training? Does torch.dist.distributedparallel (or similar torch library) distribute training across Multi-node Multi-GPU? if not, what is the best alternative?

eqy · June 3, 2021, 7:49pm

Yes, this is the purpose of DistributedDataParallel — PyTorch master documentation

bkuriach · June 3, 2021, 7:56pm

@eqy I also heard about Horovod, it does the same thing? What is the best choice for above scenario? Thank you very much for your response!!

mrshenli · June 6, 2021, 8:28pm

I also heard about Horovod, it does the same thing? What is the best choice for above scenario?

Hey @bkuriach. It depends. If you would like to have framework (PyTorch/TensorFlow), Horovod distributed package might be a better fit. But if you are already using PyTorch, PyTorch DDP might be a better fit. Quoting my own responses from another post:

One difference between PyTorch DDP is Horovod+PyTorch is that, DDP overlaps backward computation with communication. In contrast, according to the following example, Horovod synchronizes models in the optimizer step(), which won’t be able to overlap with backward computations. So, in theory, DDP should be faster.
Horovod with PyTorch — Horovod documentation