What is the key difference between torch.dist.distributedparallel and horovod?
If my understanding is correct, torch.dist.distributedparallel work on single node with one or more GPUs (it does not distribute workloads across GPUs across more than one node) whereas horovod can work with multi-node multi-gpu.
If my understanding is not correct, kindly explain when to use horovod and when to use torch.dist.distributedparallel?
Kindly share your thoughts? Thank you very much in advance!!
As given in the DDP docs,
DistributedDataParallel is able to use multiple machines:
DistributedDataParallel (DDP) implements data parallelism at the module level which can run across multiple machines. Applications using DDP should spawn multiple processes and create a single DDP instance per process.
I’m not familiar with
horovod and don’t know what the advantages might be.
PS: please don’t tag specific users, as it might discourage others to post better answers
@ptrblck Thank you very much for your response!!
One difference between PyTorch DDP is Horovod+PyTorch is that, DDP overlaps backward computation with communication. In contrast, according to the following example, Horovod synchronizes models in the optimizer
step(), which won’t be able to overlap with backward computations. So, in theory, DDP should be faster.
I don’t think so. Horovod is able to create async communication functions for parameter.grad’s hook to synchronize gradients. That gives handles of async functions, in optimizer.step(), they synchronize them so that overlap backward.