I’m currently training multiple small models across several CUDA streams to achieve parallelism with Libtorch. However, each kernel execution is very short (lasting only a few microseconds) and scattered across the timeline, which means they rarely overlap or parallelize effectively. I’ve attached a screenshot showing how dispersed the kernel executions are.
I’m wondering if it is due to the overhead of launching each kernel? And if so, is there any way to minimize the gaps between these kernels to improve parallelism? Any suggestions would be appreciated!
If your workload suffers from slow kernel launches, you could check if you could use CUDAGraphs for your use case either directly or via torch.compile.