Time per epoch increase when running experiments on multiple GPUs concurrently

I am using multiple GPUs to train several models individually (e.g. hyperparameter grid search on a model). I am using Pytorch’s DataLoader with num_workers=4 on a 4GB dataset that is completely read to RAM, and every batch is sent to GPU using [to(device)] in a normal way (i.e. within loop after reading the batch). I’ve noticed the following happening:

  • Time per epoch increases by ~30-40% when concurrently running two experiments on two separate GPUs instead of running only a single experiment on a single GPU. Similarly, when going from two to three GPUs time per epoch increases again by some 30% (over all the GPUs). I’m certain that the models are executed on the GPU, because there is normal and expected CUDA activity showing on each GPU (consistently between 50-90% depending on model variant).
  • CPU usage when using a single GPU is ~60-70% (6-core/12 thread CPU), when firing up the second GPU it never gets below 100%.

My conclusion is that it seems as though my CPU acts as a bottleneck when training on multiple GPUs. Before splashing money on a better CPU, I’d like to ascertain myself that the CPU is indeed the bottleneck and there are no further steps I can take to alleviate the issue.

What I have yet tried:

  • Using both less and more num_workers in the dataloader, but this has limited effect other than slowing down training even further when reducing the num_workers below 4.
  • Setting os.environ[‘omp_num_threads’] to 1 or 2 to ‘force’ using less threads. (see: here, or here ) Does not work.
  • Preloading entire dataset to GPU memory: does not work as dataset is too large to work with all variants of my models.
  • Setting pin_memory=False or True in DataLoader, does not really matter.

My questions:

  • Is this normal behavior? What could be the reason of such performance drop?
  • What kind of steps could I further take to try to fix this or to ascertain myself that indeed the CPU is acting as the bottleneck?
  • Maybe this is related to this bug, although that seems solved.

Setup:

  • Windows 10 Pro
  • Python 3.7
  • Pytorch 1.3.1 installed via conda
  • Cuda Toolkit 10.1
  • CuDNN 7.0
  • 64 GB RAM
  • 2x 1080Ti, 1x 2080Ti
  • I am using Spyder 3.3.6 with iPython using three separate kernels to run experiments on each GPU. Hence, I’m assuming the processes should run independently.