Using DDP for my code is 3x slower than using Horovod, both for single and multi gpu use cases. I’m trying to understand if this comes from an inherent design difference between the two frameworks, or if I broke something in DDP.
Observations:
- DDP is more cpu bounded: GPUs spend x2 more time idle waiting for the next batch than with horovod.
- DDP spends x10 more time at the beginning of each epoch to set things up (and setting persistent_workers=True doesn’t change this)
- DDP re-imports all my files at the beginning of each epoch, doing so n_gpus*n_cpu_workers times.
- DDP sometimes leads to ‘leaked semaphore objects’ errors
- DDP complains that some parameters weren’t used in producing the loss and so I need to use ‘find_unused_parameters=True’. I get no such error with vanilla single gpu code or with horovod.
- During my debugging efforts I also found that using the horovod launching method on a single gpu is faster than vanilla pytorch on a single gpu (no DDP code). This is very strange. In other words, my training epoch is twice as fast if I do this
mpirun -allow-run-as-root -np 1 -H localhost:1 -x MASTER_ADDR=127.0.0.1 -x MASTER_PORT=23457 -x HOROVOD_TIMELINE=/tmp/timeline.json -x OMP_NUM_THREADS=1 -x KMP_AFFINITY='granularity=fine,compact,1,0' -bind-to none python scripts/train_horovod.py
than if I do thispython scripts/train_singlegpu.py
.
Set up:
My DDP code follows the official guidelines and I’ve used DDP successfully in the past. The main specificity of my codebase is that my dataset class loads the nuscenes-devkit, which takes a lot of system memory. I share this object between the validation and training datasets using a global variable.