Model idle time on GPU

Initial situation: I have written an API to serve my object detection models. The model is loaded onto the GPU. Subsequently, the inferences are executed. However, if I have a break in between (let’s say 5 minutes), the first inference after the break then takes significantly longer.

My question: Are there certain parameters to prevent this? What is happening with the model on the GPU?

Your GPU might go into an idle state and the wake up might take some time. You could enable the persistence demon as described in the docs:

This approach would prevent the kernel module from fully unloading software and hardware state when no user software was using the GPU.