Parallel Processing with 2 GPUs

Hi!

I am working on making deep model computations in parallel to achieve a faster running model that is supposed work fast in .eval() mode. I tried this tutorial:

https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html

I ran this code presented in the tutorial and I could obtain faster working models through pipelining inputs. However, this tutorial’s code works for training. I modified the code so that, we are not doing any training. With these changes “Model Parallel” and “Pipeline Parallel” codes started running in similar times and I could no longer observe a speed-up.

As far as I understand, we do not need to make our code multi-threaded to achieve concurrency because CUDA kernels are launched asynchronously. So, two consecutive lines calling different models should be able to run in parallel without additional modifications. To replicate this idea I have tried the following 3 cases with all models being a simple ResNet50 model from torchvision.models:

  1. Base Model: (1 GPU, 1 model)
model_1.eval()
model_1.to(GPU_1)
for i,(images,targets) in enumerate(data_loader):
        images = list(image for image in images)
        images_1 = torch.stack(images).to(GPU_1, non_blocking=True)
        outputs_1 = model(images_1)
  1. Parallel Model: (2 GPUs, 2 models)
model_1.eval()
model_2.eval()
model_1.to(GPU_1)
model_2.to(GPU_2)
for i,(images,targets) in enumerate(data_loader):
        images = list(image for image in images)
        images_1 = torch.stack(images).to(GPU_1, non_blocking=True)
        images_2 = torch.stack(images).to(GPU_2, non_blocking=True)
        outputs_1 = model(images_1)
        outputs_2 = model_2(images_2)
  1. Non-parallel Model: (1 GPU, 2 models)
model_1.eval()
model_2.eval()
model_1.to(GPU_1)
model_2.to(GPU_1)
for i,(images,targets) in enumerate(data_loader):
        images = list(image for image in images)
        images_1 = torch.stack(images).to(GPU_1, non_blocking=True)
        images_2 = torch.stack(images).to(GPU_1, non_blocking=True)
        outputs_1 = model(images_1)
        outputs_2 = model_2(images_2)

My expectation is that cases 1&2 should have similar run-times while case 3 should take approximately twice as much time. In my experiments I saw that if the batch size was greater than 2 the assumption I made was true and cases 1&2 had similar run-times while case 3 took ~1.7x more time. However, when I reduced the batch size to 1, this no longer was the case and now cases 2&3 had similar run-times while case 1 was ~1.7x times faster.

I couldn’t figure out why this is happening. I tried changing num_workers of the````DataLoader``` but the results did not change much even though there was a minor speed-up of about 10% for case 2.

Any help is appreciated.

Thanks in advance.

You could see the timing relatively between the CPU and GPU workload.
E.g. if the GPU workload is small (due to a small model), you might see the overhead of the kernel launches as well as other CPU operations such as list(...), torch.stack(...), data loading, etc.

To isolate the actual runtime of the models in parallel, you could remove the data loading and just profile the models in isolation. Afterwards you could then profile the data loading alone and would have an idea about the speed of the different parts of the code.

Based on your description, my best guess is that the GPU workload is too small for a batch size of 1 and that the mentioned overheads are visible.