Hi!
I am working on making deep model computations in parallel to achieve a faster running model that is supposed work fast in .eval()
mode. I tried this tutorial:
https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html
I ran this code presented in the tutorial and I could obtain faster working models through pipelining inputs. However, this tutorial’s code works for training. I modified the code so that, we are not doing any training. With these changes “Model Parallel” and “Pipeline Parallel” codes started running in similar times and I could no longer observe a speed-up.
As far as I understand, we do not need to make our code multi-threaded to achieve concurrency because CUDA kernels are launched asynchronously. So, two consecutive lines calling different models should be able to run in parallel without additional modifications. To replicate this idea I have tried the following 3 cases with all models being a simple ResNet50 model from torchvision.models
:
- Base Model: (1 GPU, 1 model)
model_1.eval()
model_1.to(GPU_1)
for i,(images,targets) in enumerate(data_loader):
images = list(image for image in images)
images_1 = torch.stack(images).to(GPU_1, non_blocking=True)
outputs_1 = model(images_1)
- Parallel Model: (2 GPUs, 2 models)
model_1.eval()
model_2.eval()
model_1.to(GPU_1)
model_2.to(GPU_2)
for i,(images,targets) in enumerate(data_loader):
images = list(image for image in images)
images_1 = torch.stack(images).to(GPU_1, non_blocking=True)
images_2 = torch.stack(images).to(GPU_2, non_blocking=True)
outputs_1 = model(images_1)
outputs_2 = model_2(images_2)
- Non-parallel Model: (1 GPU, 2 models)
model_1.eval()
model_2.eval()
model_1.to(GPU_1)
model_2.to(GPU_1)
for i,(images,targets) in enumerate(data_loader):
images = list(image for image in images)
images_1 = torch.stack(images).to(GPU_1, non_blocking=True)
images_2 = torch.stack(images).to(GPU_1, non_blocking=True)
outputs_1 = model(images_1)
outputs_2 = model_2(images_2)
My expectation is that cases 1&2 should have similar run-times while case 3 should take approximately twice as much time. In my experiments I saw that if the batch size was greater than 2 the assumption I made was true and cases 1&2 had similar run-times while case 3 took ~1.7x more time. However, when I reduced the batch size to 1, this no longer was the case and now cases 2&3 had similar run-times while case 1 was ~1.7x times faster.
I couldn’t figure out why this is happening. I tried changing num_workers
of the````DataLoader``` but the results did not change much even though there was a minor speed-up of about 10% for case 2.
Any help is appreciated.
Thanks in advance.