GPU vs CPU multiple inferences

mezon · July 6, 2019, 12:47pm

Hello,

I have a question regarding running inference on a model loaded in memory
I initialize the model, run .benchmark (if GPU) and .eval() then do inference 2 times.
Time: GPU_1 = 5sec
Time: GPU_2 = 0.05sec

Time: CPU_1 = 4sec
Time: CPU_2 = 4sec

Now in tensorflow using tf.session the CPU_2 is decreased by a large margin as well.
Is there’s anything I can do to “cache” an inference model in the CPU like GPU does?

Thanks in advance,
John

ptrblck · July 6, 2019, 2:23pm

If you are timing GPU operations, note that CUDA calls are asynchronous, so that you have to add synchronization points before starting and stopping the timer:

torch.cuda.synchronize()
t0 = time.time()
# your operations
torch.cuda.synchronize()
t1 = time.time()

Most likely you are measuring the CUDA initialization time etc. in GPU_1.