How to minimize the gaps between kernels on GPU?

I’m currently training multiple small models across several CUDA streams to achieve parallelism with Libtorch. However, each kernel execution is very short (lasting only a few microseconds) and scattered across the timeline, which means they rarely overlap or parallelize effectively. I’ve attached a screenshot showing how dispersed the kernel executions are.

I’m wondering if it is due to the overhead of launching each kernel? And if so, is there any way to minimize the gaps between these kernels to improve parallelism? Any suggestions would be appreciated!

If your workload suffers from slow kernel launches, you could check if you could use CUDAGraphs for your use case either directly or via torch.compile.

Thanks for the help. I am using Libtorch, does it also have torch.compile? I couldn’t find the document of it for Libtorch

No, I don’t think torch.compile or any of its stack is exposed as a pure C++ API. You could try to torch.export your model as described here.