Torch.dist.distributedparallel vs horovod

bkuriach · June 3, 2021, 7:32pm

What is the key difference between torch.dist.distributedparallel and horovod?
If my understanding is correct, torch.dist.distributedparallel work on single node with one or more GPUs (it does not distribute workloads across GPUs across more than one node) whereas horovod can work with multi-node multi-gpu.

If my understanding is not correct, kindly explain when to use horovod and when to use torch.dist.distributedparallel?

Kindly share your thoughts? Thank you very much in advance!!

ptrblck · June 3, 2021, 9:58pm

As given in the DDP docs, DistributedDataParallel is able to use multiple machines:

DistributedDataParallel (DDP) implements data parallelism at the module level which can run across multiple machines. Applications using DDP should spawn multiple processes and create a single DDP instance per process.

I’m not familiar with horovod and don’t know what the advantages might be.

PS: please don’t tag specific users, as it might discourage others to post better answers

bkuriach · June 4, 2021, 1:32am

@ptrblck Thank you very much for your response!!

mrshenli · June 6, 2021, 8:21pm

One difference between PyTorch DDP is Horovod+PyTorch is that, DDP overlaps backward computation with communication. In contrast, according to the following example, Horovod synchronizes models in the optimizer step(), which won’t be able to overlap with backward computations. So, in theory, DDP should be faster.

https://horovod.readthedocs.io/en/stable/pytorch.html

ducviet00 · November 3, 2021, 5:55pm

I don’t think so. Horovod is able to create async communication functions for parameter.grad’s hook to synchronize gradients. That gives handles of async functions, in optimizer.step(), they synchronize them so that overlap backward.

irasit · August 26, 2022, 8:30pm

Here is some comparison from horovod:

whatdhack · February 21, 2024, 8:24pm

is there a DDP control plane like in Horovod ? Where do I find end to end architectural documentation of DDP ?