I have 3 nodes each with 2 GPUs, how can I distribute my model training? Does torch.dist.distributedparallel (or similar torch library) distribute training across Multi-node Multi-GPU? if not, what is the best alternative?
@eqy I also heard about Horovod, it does the same thing? What is the best choice for above scenario? Thank you very much for your response!!
I also heard about Horovod, it does the same thing? What is the best choice for above scenario?
Hey @bkuriach. It depends. If you would like to have framework (PyTorch/TensorFlow), Horovod distributed package might be a better fit. But if you are already using PyTorch, PyTorch DDP might be a better fit. Quoting my own responses from another post:
One difference between PyTorch DDP is Horovod+PyTorch is that, DDP overlaps backward computation with communication. In contrast, according to the following example, Horovod synchronizes models in the optimizer
step()
, which wonβt be able to overlap with backward computations. So, in theory, DDP should be faster.
Horovod with PyTorch β Horovod documentation