Pytorch profiling in multi-gpu system


My system is RTX 2080Ti * 8 and it was Turing architecture, So I have to use ncu instead of nvprof.

When I running the PyTorch with metric of ncu, If i just running the one GPU, they profile the kernel exactly what I want to. But if I running on the multi-GPU, it may be called ncclAllReduce, they cannot profile and stop before the start the PyTorch imagenet.

Can i ask why it cannot profile the imagent in multi-GPU, or recommend any else profiler…

I want to know how to profile in ncu or nvprof.

Below figure are screen shot about stopped imagenet in multi-GPU.


You could use Nsight Systems as described here.

Oh @ptrblck Thanks.

Hmm, If I used nsight, it shows the cache hit rate at multi-GPU?

And also, I hope collective communication also will be profiled.