Inference time and TFLOPS

I am currently looking into the half-precision inference time of different CNN models using the torch.autograd.profiler using two different GPUs:

  1. Nvidia RTX 2080 Ti (26.90 TFLOPS) - done locally (better CPU)
  2. Nvidia T4 (65.13 TFLOPS) - done in the cloud

It took me by surprise that the 2080 Ti is significantly faster (half the time or less), independent of batch size, input resolution, and architecture even though it has less than half the TFLOPS.

Does anyone know why?

import torch
import segmentation_models_pytorch as smp # pip install git+

runs = 10
res = 512
bs = 8
is_half = True

m = smp.Unet(encoder_name='resnet101', encoder_weights=None)

t = torch.rand((bs, 3, res, res)).cuda()
t = t.half()

if is_half:
    t = t.half()

# warm up
with torch.no_grad():

cpu_time_ms = 0
cuda_time_ms = 0
for i in range(runs):
    with torch.no_grad():
        with torch.autograd.profiler.profile(use_cuda=True) as prof:
        cpu_time_ms += prof.self_cpu_time_total / 1000
        cuda_time_ms += sum([evt.cuda_time_total for evt in prof.key_averages()]) / 1000

cpu_time_ms /= runs * bs
cuda_time_ms /= runs * bs

print('res={}x{} cuda={:.1f}ms cpu={:.1f}ms'.format(res, res, cuda_time_ms, cpu_time_ms))


unet with resnet101 as backbone and batch size 8

res=128x128 cuda=11.3ms cpu=3.0ms
res=256x256 cuda=14.5ms cpu=2.8ms
res=512x512 cuda=50.4ms cpu=7.3ms

rtx 2080 ti
res=128x128 cuda=7.5ms cpu=1.7ms
res=256x256 cuda=8.6ms cpu=1.8ms
res=512x512 cuda=21.1ms cpu=3.0ms