VGG16 faster at inference than MobileNet V3 Small on GPU

Hi.
I’m doing the inference of VGG16 and MobileNet V3 small models on Google Colab using NVIDIA Tesla T4 GPU, with PyTorch 2.2.1 framework and cuDnn 8906. When measuring the average inference time per image using the code below, I got the follwing results:
VGG16 : 14.38 ms per images
MobileNet V3 Small : 17.39 ms per images
It can be seen that VGG16 is faster than MobileNet V3 small even thogh MobileNet is more efficient and has much less GFLOPS than VGG (0.06 for MobileNet and 15.47 for VGG16).
However when the inference is done on CPU, MobileNet is much faster than VGG16 with 24ms and 906 ms per image respectively.
How could you explain this ?

The used Batch size is 1

dummy_input = torch.randn(1, 3,256,256, dtype=torch.float).to(device)
starter, ender = torch.cuda.Event(enable_timing=True), torch.cuda.Event(enable_timing=True)
timings=np.zeros((len(test_datasets),1))
#GPU-WARM-UP
for _ in range(10):
    _ = model(dummy_input)
# MEASURE PERFORMANCE
loop = tqdm(test_dataloader) # Progress bar
inference_time = 0
rep = 0
for inputs, labels, paths in loop:
  inputs = inputs.to(device)
  labels = labels.to(device)
  starter.record()
  outputs = model(inputs)

  ender.record()
  # WAIT FOR GPU SYNC
  # torch.cuda.synchronize()
  curr_time = starter.elapsed_time(ender)
  timings[rep] = curr_time
  rep = rep + 1

mean_syn = np.sum(timings) / len(test_datasets)
std_syn = np.std(timings)
print('\nAverage time per image : ', mean_syn)
print('Total inference time : ', sum(timings))

Different optimizations can be used e.g. via cuDNN. You could disable it and profile the code again using fallback kernels.

Hi. Thank you for your reply.

If I understand correctly, I should disable CuDNN and execute the same previous code. I did this using the same GPU and this is what I got:

VGG16 : 20.9 ms/image (14.38 ms with CuDNN),
MobileNet-V3-Small : 21.81 ms/img (17.39 ms with CuDNN).

It can be seen that both models are slower when CuDNN is disabled, and that VGG16 is still faster than MobileNet V3 small, except that the gap between the two inference times is smaller.

Are there any other optimizations I need to disable?

Thanks!!

You could profile the code using e.g. Nsight Systems and check the visual timelines of both models to understand where their bottlenecks are and which kernels are used. The profile might also show that your use case is e.g. CPU-limited if you are seeing a lot of whitespaces between kernels.