How to Pytorch call CUDA kernel and obtain input parameters


I would like to know is there a way/suggestions on how to know the CUDA library call/kernel call is invoked in Pytorch? For example, for general matrix matrix multiplication, an automated way to obtain the matrix input dimension and sparsity when the pytorch high level API call to low level API which further translated into library call. Where can I intercept the input information and where is the call to the exact GEMM routine.

The call setup can be found in aten/src/ATen/native/cuda/Blas.cpp and the actual calls are in aten/src/ATen/src/cuda/CUDABlas.cpp.

Thank you! I got a follow-up question. After reading the code, I did not find the handling of sparse matrix dense multiplication. For example, I see the calling to cuSparse library using Nvidia profiling tool. So, which file should I look into to find out the details of handling sparse matrix dense matrix multiplication.

The cuSPARSE calls should be defined in aten/src/ATen/native/sparse/cuda/ and aten/src/ATen/native/sparse/cuda/SparseCUDABlas.cpp.

Thanks! I read the Aten directory and I do have a follow-up question: How to trace the library invocation (e.g. GEMM) to the Intel MKL on CPU or cublas/cusparse on GPU and record the tensor size of those library call. Reading code base to figure them out seems like an inefficient way. Is there a plug-in/tool to do it automatically and record the tensor size of those library call? I tried the pytorch profiler’s tracing, but it does not give the tensor size and seems does not give the function name.

I also noticed that there are some dispatching mechanism of a kernel from CPU and GPU like here: Registering a Dispatched Operator in C++ — PyTorch Tutorials 2.0.0+cu117 documentation
Will this be helpful to track kernel invocation and intercept the input matrix size for GEMM call?

Thanks again!