Multiprocessing for multiple gpus

rahul · April 5, 2018, 5:35am

I was wondering why is it not advised to use multiple GPUs using muliprocesing? As an example, http://pytorch.org/docs/master/notes/cuda.html towards the end you have advise “Use nn.DataParallel instead of multiprocessing”
While there is an example to use multiple GPUs using multiprocessing http://pytorch.org/tutorials/intermediate/dist_tuto.html .
I am confused as to which is the correct rule? That seems contradictory?

haolibai · April 21, 2019, 10:21am

I have the same question with you.

rwightman · April 21, 2019, 11:36pm

I can’t speak to the specifics of the guidelines here. In my own usage, DataParallel is the quick and easy way to get going with multiple GPUs on a single machine.

However, if you want to push the performance, I’ve found that using the NVIDIA apex implementation of DistributedDataParallel with one GPU per process and a few of their other opts better saturates the GPUs on a single machine and usually results in approx 10-15% higher throughput.

I can’t speak to how apex DDP compares to the PyTorch native impl, I switched to apex because I’m also using AMP mixed precision from time to time.

Their example is a good sample: https://github.com/NVIDIA/apex/blob/master/examples/imagenet/main_amp.py

rwightman · April 22, 2019, 9:51pm

For curiosity’s sake, I ran a quick test on a machine that I recently bumped up to 3 pascal GPU. Previous comparison was made with 2 x RTX cards.

For imagenet style training @ 224x224, smaller model (something like a mnasnet/mobilenetv2), 8 physical core CPU:
830 img/sec avg - single training process, 3 GPU, torch.nn.DataParallel, 8 (or 9 for fairness) worker processes
1015 img/sec avg - 3 training process, 1 GPU per-proc, apex.DistributedDataParallel, 3 workers per training process

Everything else in those two runs is the same, same preprocessing, using the same ‘fast’ preload + collation routines from Nvidia’s examples. So looks like throwing in another GPU increases the impact to > 20%.