Discrepancy in CPU and GPU inference times

Hi, I have model A (a Flownet based model) and model B (a PWC-Net based model). The inference time of model B is lower than model A on GPU, but on CPU the relation is reversed. I would expect that a model which is faster on GPU would also be faster on CPU, but this doesn’t seem to be the case. Can someone please let me know if this is normal and if there are any workarounds to mitigate the issue?

Thanks.

How did you measure the performance for your CPU and GPU runs?
Did you properly synchronize the code for the GPU run before starting and stopping the timer?

If so, your scripts might either face other unrelated bottlenecks, which might not give you a proper timing, and should thus be removed (e.g. remove the data loading and use random input tensors).
Also, torch.backends.cudnn.benchmark = True will profile the kernels in the first iteration for each new input shape and will select the fastest one, which might yield a speedup.

If all this was already done, then your model might suffer from a lot of kernels with tiny workloads, so that your overall GPU runtime might see the kernel launch overheads.

Thanks for the answer. I am not sure what you meant my synchronizing the code before the GPU run. Can you please elaborate? I just timed the inference time on a single sample as follows:

net=Model().to(device).eval()
t1=time.time()
flow=net(frames)
t2=time.time()
print("Inference time =", t2-t1)

There is no data loading in the timed part of the code as frames is a preloaded tensor. I also tried adding torch.backends.cudnn.benchmark = True as you suggested, but it does not result in any significant change in the inference time in either case.

Here is a code snippet showing how to synchronize the code before starting and stopping the timer as well as using some warmup iterations to get more stable results.

If you are using cudnn.benchmark = True, the first iteration for a new input shape will be slower, as cudnn will profile different kernels and select the fastest one, so warmup iterations would be necessary.