Time per epoch increase when running experiments on multiple GPUs concurrently

oli4100 · January 31, 2020, 10:47am

I am using multiple GPUs to train several models individually (e.g. hyperparameter grid search on a model). I am using Pytorch’s DataLoader with num_workers=4 on a 4GB dataset that is completely read to RAM, and every batch is sent to GPU using [to(device)] in a normal way (i.e. within loop after reading the batch). I’ve noticed the following happening:

Time per epoch increases by ~30-40% when concurrently running two experiments on two separate GPUs instead of running only a single experiment on a single GPU. Similarly, when going from two to three GPUs time per epoch increases again by some 30% (over all the GPUs). I’m certain that the models are executed on the GPU, because there is normal and expected CUDA activity showing on each GPU (consistently between 50-90% depending on model variant).
CPU usage when using a single GPU is ~60-70% (6-core/12 thread CPU), when firing up the second GPU it never gets below 100%.

My conclusion is that it seems as though my CPU acts as a bottleneck when training on multiple GPUs. Before splashing money on a better CPU, I’d like to ascertain myself that the CPU is indeed the bottleneck and there are no further steps I can take to alleviate the issue.

What I have yet tried:

Using both less and more num_workers in the dataloader, but this has limited effect other than slowing down training even further when reducing the num_workers below 4.
Setting os.environ[‘omp_num_threads’] to 1 or 2 to ‘force’ using less threads. (see: here, or here ) Does not work.
Preloading entire dataset to GPU memory: does not work as dataset is too large to work with all variants of my models.
Setting pin_memory=False or True in DataLoader, does not really matter.

My questions:

Is this normal behavior? What could be the reason of such performance drop?
What kind of steps could I further take to try to fix this or to ascertain myself that indeed the CPU is acting as the bottleneck?
Maybe this is related to this bug, although that seems solved.

Setup:

Windows 10 Pro
Python 3.7
Pytorch 1.3.1 installed via conda
Cuda Toolkit 10.1
CuDNN 7.0
64 GB RAM
2x 1080Ti, 1x 2080Ti
I am using Spyder 3.3.6 with iPython using three separate kernels to run experiments on each GPU. Hence, I’m assuming the processes should run independently.