Hi, so I’ve been working on a model that uses a few different components and when benchmarking found some bizarre inconsistencies with the performance of the various models.

All models are just basic ConvNets with BatchNorm and ReLU layers.

Specs:

- RTX 3090
- CUDA 11.4
- PyTorch 1.10.0
- cuDNN 8.2.0

The initial setup looks somewhat like this:

```
model_1 = Model_1().cuda()
model_2 = Model_2().cuda()
x = torch.rand(1, 5, 3, 416, 416, device='cuda')
```

Now, the weird part is the performance issue that I experience with different combinations of the models.

**Example 1:**

```
for _ in range(100):
model_2(model_1(x))
for _ in range(100):
model_2(x)
# >> Second loop takes 0.053 seconds
```

**Example 2:**

```
for _ in range(100):
model_2(x)
# >> This loop takes 0.47 seconds
```

**Example 3:**

```
for _ in range(100):
model_2(x)
for _ in range(100):
model_2(x)
# >> Second loop takes 2.5 seconds
```

**Example 4:**

```
for _ in range(100):
model_1(x)
# >> This loop takes 1.72 seconds
for _ in range(100):
model_2(x)
# >> This loop takes 0.01 seconds
for _ in range(100):
model_1(x)
model_2(x)
# >> This loop takes 8 seconds
```

**Example 5:**

```
for _ in range(100):
model_1(x)
model_2(x)
# >> This loop takes 5 seconds
```

I have no idea what is going on. Why am I experiencing an ~1.6x reduction in performance between **Example 4** and **Example 5** by just removing the two initial loops.

Why does *model_2* take ~0.47 seconds to loop 100 times alone (**Example 2**) but when I add a loop before with `model_2(model_1(x))`

(**Example 1**) it takes ~0.05 seconds to loop 100 times alone. But then adding a loop with `model_2(x)`

(**Example 3**) reduces the performance by ~5x.

What is going on here? How do I mitigate the issue?