How to trace kernels in pytorch

I wanna trace all related kernels when making DL inference in pytorch and store kernel information locally for other purposes. Unfortunately, I cannot find where I could get this in source code. In other words, where pytorch launch kernels when DL inference. Could you give me some suggestions?
Thanks so much !
God bless you

I’m not sure which “kernel information” you would like to store, but would profiling the workload work as described here?

There is not a single place where the kernels are launched and you would need to check which operations are used. E.g. a lot of kernels are launched in files stored in aten/src/ATen/, but I also don’t fully understand your use case and why you are looking for the locations of all kernel launches.

When I profiled vgg11 model inference with batchsize 1 , using Nsight System, I found that there are 45 kernels executed in total . I wanna trace all these kernels and re-organize them into a graph or stream besides some academic purpose. So how can I get that

I still don’t know what exactly you want to read in the end, but since you are already using Nsight you might want to add the --trace option and try to process the profile created by Nsight Systems, as it would show the callstack for each kernel.

actually, I wanna implement cuda-graph in C++ level manually rather than depending on Pytorch

Unsure why you are not using the built-in CUDA Graphs util. as it sounds quite challenging, but I’m sure you have valid reasons for it. Good luck and let me know how it goes!

I do not know how to collect kernels using pytorch API…

I am reading pytorch source code related to CUDA graph. I am confused why torch devs designs cuda generator in struct CUDAGraph in file CUDAGraph.h How can I use it

You can use CUDA Graphs in PyTorch as described here. I don’t know what exactly is confusing about the file naming.