Hi, so I’ve been working on a model that uses a few different components and when benchmarking found some bizarre inconsistencies with the performance of the various models.
All models are just basic ConvNets with BatchNorm and ReLU layers.
Specs:
- RTX 3090
- CUDA 11.4
- PyTorch 1.10.0
- cuDNN 8.2.0
The initial setup looks somewhat like this:
model_1 = Model_1().cuda()
model_2 = Model_2().cuda()
x = torch.rand(1, 5, 3, 416, 416, device='cuda')
Now, the weird part is the performance issue that I experience with different combinations of the models.
Example 1:
for _ in range(100):
model_2(model_1(x))
for _ in range(100):
model_2(x)
# >> Second loop takes 0.053 seconds
Example 2:
for _ in range(100):
model_2(x)
# >> This loop takes 0.47 seconds
Example 3:
for _ in range(100):
model_2(x)
for _ in range(100):
model_2(x)
# >> Second loop takes 2.5 seconds
Example 4:
for _ in range(100):
model_1(x)
# >> This loop takes 1.72 seconds
for _ in range(100):
model_2(x)
# >> This loop takes 0.01 seconds
for _ in range(100):
model_1(x)
model_2(x)
# >> This loop takes 8 seconds
Example 5:
for _ in range(100):
model_1(x)
model_2(x)
# >> This loop takes 5 seconds
I have no idea what is going on. Why am I experiencing an ~1.6x reduction in performance between Example 4 and Example 5 by just removing the two initial loops.
Why does model_2 take ~0.47 seconds to loop 100 times alone (Example 2) but when I add a loop before with model_2(model_1(x))
(Example 1) it takes ~0.05 seconds to loop 100 times alone. But then adding a loop with model_2(x)
(Example 3) reduces the performance by ~5x.
What is going on here? How do I mitigate the issue?