Hello,
I have a model that combines the inference of two models, as the following:
class model_all(nn.Module):
def __init__(self,dense,shuffle):
super(model_all,self).__init__()
self.dense = dense
self.shuffle = shuffle
def forward(self,input):
self.dense(input)
self.shuffle(input)
I am expecting the inference of the two models to happen in a parallel since cuda is running asynchronously. In other words,
model_all_inference_time < (dense_inference_time + shuffle_infernce_time).
However, what I get is
model_all inference_time ~= (dense_inference_time + shuffle_infernce_time)
When I measured inference time of model_all using this block of code output was 0.0228.
dummy = torch.ones((1,3,256,512)).to(device)
t = []
for _ in range(1000):
tic = time.perf_counter()
model(dummy)
torch.cuda.synchronize()
toc = time.perf_counter()
t.append(toc-tic)
print(sum(t)/1000)
Then I measured self.shuffle
inference time by commenting self.dense(input)
in forward function and output was 0.0069. Next, I measured self.dense
inference time by commenting self.shuffle(input)
and output was 0.0161. So, we can see that 0.0161+0.0069~=0.0228 which shows that models inference wasn’t in parallel but sequential .
Also, GPU utilization doesn’t exceed 50% with model_all inference. But with self.dense infernce it reaches 65%.
Is there a way to run the sub two models in parallel?
I am using pytorch 1.1.0, cuda 9 and nvidia TITAN Xp.
You can use this minimal code to reproduce the issue. Thank you!
import torch
import torch.nn as nn
import time
import torchvision
device = torch.device('cuda')
class model_all(nn.Module):
def __init__(self,dense,shuffle):
super(model_all,self).__init__()
self.dense = dense
self.shuffle = shuffle
def forward(self,input):
self.dense(input)
self.shuffle(input)
dense = torchvision.models.densenet121(pretrained=False).to(device)
shuffle = torchvision.models.shufflenet_v2_x0_5(pretrained=False).to(device)
model = model_all(dense,shuffle)
dummy = torch.ones((1,3,256,512)).to(device)
t = []
for _ in range(1000):
tic = time.perf_counter()
model(dummy)
torch.cuda.synchronize()
toc = time.perf_counter()
t.append(toc-tic)
print(sum(t)/1000)