I am using multiple GPUs to train several models individually (e.g. hyperparameter grid search on a model). I am using Pytorch’s DataLoader with num_workers=4 on a 4GB dataset that is completely read to RAM, and every batch is sent to GPU using [to(device)] in a normal way (i.e. within loop after reading the batch). I’ve noticed the following happening:
- Time per epoch increases by ~30-40% when concurrently running two experiments on two separate GPUs instead of running only a single experiment on a single GPU. Similarly, when going from two to three GPUs time per epoch increases again by some 30% (over all the GPUs). I’m certain that the models are executed on the GPU, because there is normal and expected CUDA activity showing on each GPU (consistently between 50-90% depending on model variant).
- CPU usage when using a single GPU is ~60-70% (6-core/12 thread CPU), when firing up the second GPU it never gets below 100%.
My conclusion is that it seems as though my CPU acts as a bottleneck when training on multiple GPUs. Before splashing money on a better CPU, I’d like to ascertain myself that the CPU is indeed the bottleneck and there are no further steps I can take to alleviate the issue.
What I have yet tried:
- Using both less and more num_workers in the dataloader, but this has limited effect other than slowing down training even further when reducing the num_workers below 4.
- Setting os.environ[‘omp_num_threads’] to 1 or 2 to ‘force’ using less threads. (see: here, or here ) Does not work.
- Preloading entire dataset to GPU memory: does not work as dataset is too large to work with all variants of my models.
- Setting pin_memory=False or True in DataLoader, does not really matter.
My questions:
- Is this normal behavior? What could be the reason of such performance drop?
- What kind of steps could I further take to try to fix this or to ascertain myself that indeed the CPU is acting as the bottleneck?
- Maybe this is related to this bug, although that seems solved.
Setup:
- Windows 10 Pro
- Python 3.7
- Pytorch 1.3.1 installed via conda
- Cuda Toolkit 10.1
- CuDNN 7.0
- 64 GB RAM
- 2x 1080Ti, 1x 2080Ti
- I am using Spyder 3.3.6 with iPython using three separate kernels to run experiments on each GPU. Hence, I’m assuming the processes should run independently.