Hello,

I have a model that combines the inference of two models, as the following:

```
class model_all(nn.Module):
def __init__(self,dense,shuffle):
super(model_all,self).__init__()
self.dense = dense
self.shuffle = shuffle
def forward(self,input):
self.dense(input)
self.shuffle(input)
```

I am expecting the inference of the two models to happen in a parallel since cuda is running asynchronously. In other words,

model_all_inference_time < (dense_inference_time + shuffle_infernce_time).

However, what I get is

model_all inference_time ~= (dense_inference_time + shuffle_infernce_time)

When I measured inference time of model_all using this block of code output was 0.0228.

```
dummy = torch.ones((1,3,256,512)).to(device)
t = []
for _ in range(1000):
tic = time.perf_counter()
model(dummy)
torch.cuda.synchronize()
toc = time.perf_counter()
t.append(toc-tic)
print(sum(t)/1000)
```

Then I measured `self.shuffle`

inference time by commenting `self.dense(input)`

in forward function and output was 0.0069. Next, I measured `self.dense`

inference time by commenting `self.shuffle(input)`

and output was 0.0161. So, we can see that 0.0161+0.0069~=0.0228 which shows that models inference wasn’t in parallel but sequential .

Also, GPU utilization doesn’t exceed 50% with model_all inference. But with self.dense infernce it reaches 65%.

Is there a way to run the sub two models in parallel?

I am using pytorch 1.1.0, cuda 9 and nvidia TITAN Xp.

You can use this minimal code to reproduce the issue. Thank you!

```
import torch
import torch.nn as nn
import time
import torchvision
device = torch.device('cuda')
class model_all(nn.Module):
def __init__(self,dense,shuffle):
super(model_all,self).__init__()
self.dense = dense
self.shuffle = shuffle
def forward(self,input):
self.dense(input)
self.shuffle(input)
dense = torchvision.models.densenet121(pretrained=False).to(device)
shuffle = torchvision.models.shufflenet_v2_x0_5(pretrained=False).to(device)
model = model_all(dense,shuffle)
dummy = torch.ones((1,3,256,512)).to(device)
t = []
for _ in range(1000):
tic = time.perf_counter()
model(dummy)
torch.cuda.synchronize()
toc = time.perf_counter()
t.append(toc-tic)
print(sum(t)/1000)
```