Hi everyone, I’m seeing a consistent issue during GPU inference where the first batch becomes significantly slower after a short idle period, even when data is fully preloaded and H2D/D2H transfers are negligible. When I run workloads continuously (e.g., queueing datasets back-to-back), inference times stay stable (~13–17 ms), but if there’s a small gap between runs, the first batch of the next workload can spike dramatically (e.g., 13 ms → 160 ms), even with identical datasets and fixed batch sizes. This behavior is reproducible in both PyTorch and TensorRT, and profiling shows the delay occurs in the first GPU kernels (like GEMM), not in data loading. Using a queue and preloading reduces the issue, and it disappears entirely if execution remains continuous, which makes me suspect a GPU “cold start” effect after idling (e.g., power state changes, cuBLAS/cuDNN reinitialization, or kernel scheduling overhead). Is this expected behavior in PyTorch CUDA execution, and what’s the recommended way to mitigate it in production inference systems?
does this occur anytime a new process / CUDA context is created, or a new type of kernel is called whenever a longer gap happens ?
In that case, it could be due to lazy module loading (loading the kernel code itself onto the GPU’s memory). You can disable lazy module loading by setting the environment variable CUDA_MODULE_LOADING=EAGER . This will cause a big spike when the context is created itself, but will should then mitigate the spikes during execution, so it might make it easier for you to shift the loading / spike to where it is easier to handle for your app.
Can you try this and check whether it makes any difference? If not, this is likely not related to module loading at all.