Profiling computations at the operation level

I am trying to profile all the addition or multiplication computations completed at the very bottom operation level (two input operands and the result, for example, 2 + 3 = 5) during the inference of a model.

Is it possible to achieve this in python (such as the pytorch profiler) or I need to look into the C++ back end? (Will also need some hints if need to go deep into the C++)

Thank you!

Based on my experience.

If you only want to calculate the execution time of some operation. Tensor Broad could help you. No matter your code is running on CPU or GPU.

If you want to dig into the GPU kernel level for more detailed information like SM utilization or memory workload, you can use Nsight Compute to profile your program. It can provide detailed information of GPU utilization.

Thank you for your suggestions! Actually what I am trying to do is to replace the multiplications and additions in the convolutions with my custom operations (like myadd, mymult) for some purposes so I am trying to find a way to do so… (I only need to replace during forward pass though)

The PyTorch profiler does exactly that for PyTorch functions and has good analysis options.

Then you may get some help here.