GPU vs CPU multiple inferences


I have a question regarding running inference on a model loaded in memory
I initialize the model, run .benchmark (if GPU) and .eval() then do inference 2 times.
Time: GPU_1 = 5sec
Time: GPU_2 = 0.05sec

Time: CPU_1 = 4sec
Time: CPU_2 = 4sec

Now in tensorflow using tf.session the CPU_2 is decreased by a large margin as well.
Is there’s anything I can do to “cache” an inference model in the CPU like GPU does?

Thanks in advance,

If you are timing GPU operations, note that CUDA calls are asynchronous, so that you have to add synchronization points before starting and stopping the timer:

t0 = time.time()
# your operations
t1 = time.time()

Most likely you are measuring the CUDA initialization time etc. in GPU_1.