Find actual gpu execution time

Suppose I have a model and it has only one layer, say the nn.linear layer. I am interested in finding out the acutal time of the operation i.e. matmul and bias adddition excluding the time to move the data around that is calls such as cudamalloc() or cudamallocmanaged().
What I plan of doing is finding out the cuda/cublas calls that nn.linear does and the instrument that call to get the time.
I have this idea but dont know how to execute this.
I have a NLP model(which I want to instrument) inside a container, which has pytorch built from nvidia ngc pytorch image. Any suggestions on how to do this?

You could use nsys to profile your code and visualize specific parts of your model using Nsight Systems.
Since you are using the NGC container, nsys will already be installed.
Have a look at the nvtx docs to see how to set range markers.

Ok thanks for the reply. i was abel to get the overall statistics. Now suppose that I have many layers is there any way to get profiling results at the granularity level of the layers?