How to minimize the gaps between kernels on GPU?

Zhewen_Hu · November 12, 2024, 4:57am

I’m currently training multiple small models across several CUDA streams to achieve parallelism with Libtorch. However, each kernel execution is very short (lasting only a few microseconds) and scattered across the timeline, which means they rarely overlap or parallelize effectively. I’ve attached a screenshot showing how dispersed the kernel executions are.

I’m wondering if it is due to the overhead of launching each kernel? And if so, is there any way to minimize the gaps between these kernels to improve parallelism? Any suggestions would be appreciated!

ptrblck · November 12, 2024, 11:14pm

If your workload suffers from slow kernel launches, you could check if you could use CUDAGraphs for your use case either directly or via torch.compile.

Zhewen_Hu · November 12, 2024, 11:31pm

Thanks for the help. I am using Libtorch, does it also have torch.compile? I couldn’t find the document of it for Libtorch

ptrblck · November 12, 2024, 11:39pm

No, I don’t think torch.compile or any of its stack is exposed as a pure C++ API. You could try to torch.export your model as described here.