GPU vs CPU multiple inferences

Hello,

I have a question regarding running inference on a model loaded in memory
I initialize the model, run .benchmark (if GPU) and .eval() then do inference 2 times.
Time: GPU_1 = 5sec
Time: GPU_2 = 0.05sec

Time: CPU_1 = 4sec
Time: CPU_2 = 4sec

Now in tensorflow using tf.session the CPU_2 is decreased by a large margin as well.
Is there’s anything I can do to “cache” an inference model in the CPU like GPU does?

Thanks in advance,
John

If you are timing GPU operations, note that CUDA calls are asynchronous, so that you have to add synchronization points before starting and stopping the timer:

torch.cuda.synchronize()
t0 = time.time()
# your operations
torch.cuda.synchronize()
t1 = time.time()

Most likely you are measuring the CUDA initialization time etc. in GPU_1.