Hello!
I am currently working on benchmarking performances of CUDA kernels that are widely used in certain scientific PyTorch models and want to know if its possible to directly invoke CUDA kernels from a PyTorch script.
For example, I have identified a handful of target kernels for benchmarking (say aten::native_batch_norm
, aten::conv2d
) and now want to benchmark each kernel with different sized inputs, different data sparsities, different GPUs, etc.
Is there a way I can access these functions directly in a Pytorch script? Otherwise, are there any established workflows for benchmarking these models without falling back to pure CUDA C++?
Thanks in advance!