YolovV7 inference with libtorch c++

I am using libtorch to perform inference with a TorchScripted YOLOv7 model. The issue I’m facing is that during the second inference call (not the first one), the inference takes much longer (around 20 seconds). After that, the inference time drops to about 20 milliseconds.

Upon profiling, I discovered that during the 20-second delay, there are numerous cuModuleDataLoad API calls. This suggests that Torch is trying to load and compile the kernels.

When I set torch::jit::setGraphExecutorOptimize(false), the warm-up time reduced to 400 milliseconds, but the remaining inference time increased to 40 milliseconds. (Inference times of less than 30 milliseconds are crucial in my case.)

I understand that enabling GraphExecutorOptimize fuses certain kernels and creates input-size-specific kernels to speed up inference. This is why the inference time is doubled when the optimization is disabled.

My question is: do you know how to store the fused and optimized kernels to avoid excessive warm-up time?

I am aware that I can add fake inferences with random inputs to warm up the model. However, my goal is to perform inference with my actual inputs as quickly as possible, without waiting for a warm-up with fake data. In other words, I cannot afford to wait for the warm-up process to complete, as it results in the same delay as waiting for the first or second real frame.

My question is: do you know how to store the fused and optimized kernels to avoid excessive warm-up time?

If your input shape is static you could consider CUDA graphs for AOT compilation.