Weird unexpected performance variability

rmd · November 27, 2021, 10:01pm

Hi, so I’ve been working on a model that uses a few different components and when benchmarking found some bizarre inconsistencies with the performance of the various models.

All models are just basic ConvNets with BatchNorm and ReLU layers.

Specs:

RTX 3090
CUDA 11.4
PyTorch 1.10.0
cuDNN 8.2.0

The initial setup looks somewhat like this:

model_1 = Model_1().cuda()
model_2 = Model_2().cuda()

x = torch.rand(1, 5, 3, 416, 416, device='cuda')

Now, the weird part is the performance issue that I experience with different combinations of the models.

Example 1:

for _ in range(100):
    model_2(model_1(x))

for _ in range(100):
    model_2(x)
# >> Second loop takes 0.053 seconds

Example 2:

for _ in range(100):
    model_2(x)
# >> This loop takes 0.47 seconds

Example 3:

for _ in range(100):
    model_2(x)

for _ in range(100):
    model_2(x)
# >> Second loop takes 2.5 seconds

Example 4:

for _ in range(100):
    model_1(x)
# >> This loop takes 1.72 seconds

for _ in range(100):
   model_2(x)
# >> This loop takes 0.01 seconds

for _ in range(100):
    model_1(x)
    model_2(x)
# >> This loop takes 8 seconds

Example 5:

for _ in range(100):
    model_1(x)
    model_2(x)
# >> This loop takes 5 seconds

I have no idea what is going on. Why am I experiencing an ~1.6x reduction in performance between Example 4 and Example 5 by just removing the two initial loops.

Why does model_2 take ~0.47 seconds to loop 100 times alone (Example 2) but when I add a loop before with model_2(model_1(x)) (Example 1) it takes ~0.05 seconds to loop 100 times alone. But then adding a loop with model_2(x) (Example 3) reduces the performance by ~5x.

What is going on here? How do I mitigate the issue?

ptrblck · November 29, 2021, 7:11am

CUDA operations are executed asynchronously so to you would need to synchronize the code via torch.cuda.synchronize() before starting and stopping the timers.
Otherwise you would profile the kernel launches, will see accumulated times in blocking operations etc., which seems to be the case here.