Reasons why Horovod is much faster than DDP

polo5 · October 11, 2023, 1:06pm

Using DDP for my code is 3x slower than using Horovod, both for single and multi gpu use cases. I’m trying to understand if this comes from an inherent design difference between the two frameworks, or if I broke something in DDP.

Observations:

DDP is more cpu bounded: GPUs spend x2 more time idle waiting for the next batch than with horovod.
DDP spends x10 more time at the beginning of each epoch to set things up (and setting persistent_workers=True doesn’t change this)
DDP re-imports all my files at the beginning of each epoch, doing so n_gpus*n_cpu_workers times.
DDP sometimes leads to ‘leaked semaphore objects’ errors
DDP complains that some parameters weren’t used in producing the loss and so I need to use ‘find_unused_parameters=True’. I get no such error with vanilla single gpu code or with horovod.
During my debugging efforts I also found that using the horovod launching method on a single gpu is faster than vanilla pytorch on a single gpu (no DDP code). This is very strange. In other words, my training epoch is twice as fast if I do this mpirun -allow-run-as-root -np 1 -H localhost:1 -x MASTER_ADDR=127.0.0.1 -x MASTER_PORT=23457 -x HOROVOD_TIMELINE=/tmp/timeline.json -x OMP_NUM_THREADS=1 -x KMP_AFFINITY='granularity=fine,compact,1,0' -bind-to none python scripts/train_horovod.py than if I do this python scripts/train_singlegpu.py.

Set up:
My DDP code follows the official guidelines and I’ve used DDP successfully in the past. The main specificity of my codebase is that my dataset class loads the nuscenes-devkit, which takes a lot of system memory. I share this object between the validation and training datasets using a global variable.

polo5 · October 11, 2023, 4:43pm

More specifically, this parameter is the cause of mpirun being faster than vanilla pytorch for my code:

wconstab · October 16, 2023, 8:32pm

it would help to file a github issue and include a reproducible code snippet so we can look into these issues.

In lieu of that, just off the top of my head

(1) seems a bit surprising, but it may depend on the model. DDP shouldn’t really be doing anything during model forward, but it does schedule allreduce operations during backward. if you’re profiling, you might be able to shed light on what cpu code is running.

(2) and (3) are also surprising, DDP itself is not really a trainer or a framework. it shouldn’t have control over the beginning of an epoch, and it definitely shouldn’t be importing your files. All DDP does is wrap your model code with something that schedules allreduce ops during backwards.
(4) would be good to file a bug report for if you can reproduce reliably
(5) is legitimate: the way DDP works, it needs to know which parameters to wait for gradients, and this can either be accomplished by using all the parameters (faster), or by enabling the mode to detect unused parameters (a bit slower). Note there is also a ‘static_graph=True’ option that may be worth trying if you can gaurantee your graph is indeed static

(6) again is surprising. i wonder if you’d find the same is true if you comment out the model training and just time the import of torch and your frameworks.

polo5 · February 9, 2024, 5:04pm

Thanks for sharing your thoughts.

Let me reply for the record in case someone else is looking at this in the future.

By “DDP” being slow I wasn’t being very precise, I meant “DDP+DistributedSampler” are slow, and probably the latter is to blame in this context.

I have had the same issue described above in a new codebase, and can confirm that using DPP and DistributedSampler imports my dataset.py files n_workers*n_gpus times there as well. I’m no expert but I reckon this is probably normal behaviour, and horovod may use a different way to do threading that doesn’t induce this behavior.

The difference in my new repo is that setting persistent_workers=True does help for epoch>1, and so this large inefficiency of DDP/DistributedSampler is only a problem on the first epoch, when it takes 3 minutes to even start training.

People whose cpu can’t afford persistent_workers may be stuck at extremely low speeds, especially if the dataset is pretty small and you do many epochs.