Initial situation: I have written an API to serve my object detection models. The model is loaded onto the GPU. Subsequently, the inferences are executed. However, if I have a break in between (let’s say 5 minutes), the first inference after the break then takes significantly longer.
My question: Are there certain parameters to prevent this? What is happening with the model on the GPU?