Inference of two models isn't parallelized on same GPU

YaraAlnaggar · May 19, 2020, 12:04am

Hello,

I have a model that combines the inference of two models, as the following:

class model_all(nn.Module):
    def __init__(self,dense,shuffle):
        super(model_all,self).__init__()
        self.dense = dense
        self.shuffle = shuffle

    def forward(self,input):
        self.dense(input)
        self.shuffle(input)

I am expecting the inference of the two models to happen in a parallel since cuda is running asynchronously. In other words,

model_all_inference_time < (dense_inference_time + shuffle_infernce_time).

However, what I get is

model_all inference_time ~= (dense_inference_time + shuffle_infernce_time)

When I measured inference time of model_all using this block of code output was 0.0228.

dummy = torch.ones((1,3,256,512)).to(device)
t = []
for _ in range(1000):
    tic = time.perf_counter()
    model(dummy)
    torch.cuda.synchronize()
    toc = time.perf_counter()
    t.append(toc-tic)
print(sum(t)/1000)

Then I measured self.shuffle inference time by commenting self.dense(input) in forward function and output was 0.0069. Next, I measured self.dense inference time by commenting self.shuffle(input) and output was 0.0161. So, we can see that 0.0161+0.0069~=0.0228 which shows that models inference wasn’t in parallel but sequential .
Also, GPU utilization doesn’t exceed 50% with model_all inference. But with self.dense infernce it reaches 65%.

Is there a way to run the sub two models in parallel?

I am using pytorch 1.1.0, cuda 9 and nvidia TITAN Xp.
You can use this minimal code to reproduce the issue. Thank you!

import torch
import torch.nn as nn
import time
import torchvision
device = torch.device('cuda')

class model_all(nn.Module):
    def __init__(self,dense,shuffle):
        super(model_all,self).__init__()
        self.dense = dense
        self.shuffle = shuffle
    def forward(self,input):
        self.dense(input)
        self.shuffle(input)

dense = torchvision.models.densenet121(pretrained=False).to(device)
shuffle = torchvision.models.shufflenet_v2_x0_5(pretrained=False).to(device)
model = model_all(dense,shuffle)
dummy = torch.ones((1,3,256,512)).to(device)
t = []
for _ in range(1000):
    tic = time.perf_counter()
    model(dummy)
    torch.cuda.synchronize()
    toc = time.perf_counter()
    t.append(toc-tic)
print(sum(t)/1000)

ptrblck · May 19, 2020, 10:02am

If a submodule already uses all SMs on your GPU, you won’t be able to run another model in parallel.
A lot of layers are used in such a kernel launch configuration to saturate the GPU as much as possible.

googlebot · May 19, 2020, 11:50am

You’re not using torch.cuda.Stream, this means that single serialized queue of cuda operations is used. So, even if your first model is non-blocking (memory copying is blocking), async mode of cuda by itself only allows you to enqueue ops from second model earlier.