Profile with multi-cards running principle

Hi, I am using profile to supervise the bottleneck of network.I find a weired phenomenon that, when I run a short code use cuda,I use ‘watch -n 1 nvidia-smi’ in another terminal,then I find that, four processes run on four cards one by one.
I find reason in origin codes, I find profiler_cuda.cpp, in which it will warmup all cuda devices in onEachDevice interface. And in record interface, it use cudaGetDevice, which returns the current device for the calling host thread.
And I am confused, if profiler use underlying thread on one single card, then why will run four processes on four cards?
And how profile deals with the four results?Use the host thread result?
And how to use profile in multi-cards? In which case recommend user to use multicards profile?
Are there more detailed information I can find or provide with me?
@ptrblck @albanD
Thank you very much!
May you a good day!

I don’t quite understand the issue.
Are you manually calling onEachDevice or how are you using the profiler?
Also, are you using the C++ API or are you running a Python script? :slight_smile:

I run Python script, and I see Python code, followed by C++ code.In C++ code, I find onEachDevice is called at the beginning of profiler start.
The phenomenon I saw is when I run Python code, I opened another terminal, and found four cards were running processes, but I didn’t set multi-cards and run multi-process.So I am confused.
Later, I put servel kernels on different kernels with ‘torch.cuda.set_device(0/1/2/3), y = y.to(“cuda:0/1/2/3”)’, and found that profiler running on serverl cards when tensors were put to different cards , and at last deal with the result by host thread.
At last, I want to summarize:profile with mulit-cuda cards will warm up all cards, and if kernels in python are put to different cards, the card will run a new process to run the kernel. Infomation such as tid will be collected in Event and return back to profile(Python) and show to user.
Is there anything wrong?
Thank you a lot!