Inference of two models isn't parallelized on same GPU

Hello,

I have a model that combines the inference of two models, as the following:

class model_all(nn.Module):
    def __init__(self,dense,shuffle):
        super(model_all,self).__init__()
        self.dense = dense
        self.shuffle = shuffle

    def forward(self,input):
        self.dense(input)
        self.shuffle(input)

I am expecting the inference of the two models to happen in a parallel since cuda is running asynchronously. In other words,

model_all_inference_time < (dense_inference_time + shuffle_infernce_time).

However, what I get is

model_all inference_time ~= (dense_inference_time + shuffle_infernce_time)

When I measured inference time of model_all using this block of code output was 0.0228.

dummy = torch.ones((1,3,256,512)).to(device)
t = []
for _ in range(1000):
    tic = time.perf_counter()
    model(dummy)
    torch.cuda.synchronize()
    toc = time.perf_counter()
    t.append(toc-tic)
print(sum(t)/1000)

Then I measured self.shuffle inference time by commenting self.dense(input) in forward function and output was 0.0069. Next, I measured self.dense inference time by commenting self.shuffle(input) and output was 0.0161. So, we can see that 0.0161+0.0069~=0.0228 which shows that models inference wasn’t in parallel but sequential .
Also, GPU utilization doesn’t exceed 50% with model_all inference. But with self.dense infernce it reaches 65%.

Is there a way to run the sub two models in parallel?

I am using pytorch 1.1.0, cuda 9 and nvidia TITAN Xp.
You can use this minimal code to reproduce the issue. Thank you!

import torch
import torch.nn as nn
import time
import torchvision
device = torch.device('cuda')

class model_all(nn.Module):
    def __init__(self,dense,shuffle):
        super(model_all,self).__init__()
        self.dense = dense
        self.shuffle = shuffle
    def forward(self,input):
        self.dense(input)
        self.shuffle(input)

dense = torchvision.models.densenet121(pretrained=False).to(device)
shuffle = torchvision.models.shufflenet_v2_x0_5(pretrained=False).to(device)
model = model_all(dense,shuffle)
dummy = torch.ones((1,3,256,512)).to(device)
t = []
for _ in range(1000):
    tic = time.perf_counter()
    model(dummy)
    torch.cuda.synchronize()
    toc = time.perf_counter()
    t.append(toc-tic)
print(sum(t)/1000)

If a submodule already uses all SMs on your GPU, you won’t be able to run another model in parallel.
A lot of layers are used in such a kernel launch configuration to saturate the GPU as much as possible.

You’re not using torch.cuda.Stream, this means that single serialized queue of cuda operations is used. So, even if your first model is non-blocking (memory copying is blocking), async mode of cuda by itself only allows you to enqueue ops from second model earlier.