Calling individual CUDA kernels directly in Pytorch possible?


I am currently working on benchmarking performances of CUDA kernels that are widely used in certain scientific PyTorch models and want to know if its possible to directly invoke CUDA kernels from a PyTorch script.

For example, I have identified a handful of target kernels for benchmarking (say aten::native_batch_norm, aten::conv2d) and now want to benchmark each kernel with different sized inputs, different data sparsities, different GPUs, etc.

Is there a way I can access these functions directly in a Pytorch script? Otherwise, are there any established workflows for benchmarking these models without falling back to pure CUDA C++?

Thanks in advance!

If you want to have a way to call TorchScript ops from Python under their TorchScript name, you can use torch.ops, e.g. torch.ops.aten.native_batch_norm. Now, this will go through the various “accounting” layers between Python → cpp and the PyTorch dispatching. The latter can be reduced by using inference mode. But that is about it. Depending on your input sizes, the overhead in addition to the kernels, might be large or small.

Best regards