Interpreting profiler results

When profiling a network, I see that the non-zero operation aparently takes a lot of time around 20ms:

Digging deeper into this I used
torch.cuda.synchronize() before profiling that operator to see that the non-zero operator takes only around 1ms

I am thinking, that some other kernel might take some time before and therefore the long wait/synchronize is added. However, I am not sure how to procede from there. I have no information about which operators are running. Is there any detailed tutorial on the profiler, or can someone help out here?

The nonzero operation will synchronize the code and will thus accumulate the execution time of running (async) CUDA operations. If you manually synchronize the code the actual time will be displayed.

Thanks @ptrblck that is my understanding. What I am trying to grasp is which operators are running in the background that cause the syncing to take this much time, independent of which operator triggers it. I am guessing there are some operators from the layers before, mainly convolutions, which still need more time to finish, e.g. volta_sgemm_128x32_tn?! It’s not clear to me which these are, to find out if i can optimize sth in the layers before.

E.g. I ran the network removing the intermediate layers with the non-zero function, which lead to a profiled speedup of roughly the syncing time as well, so I assume there is some potential there.

Yes, the previous operations will be synchronized and will be shown in the profile. I don’t fully understand what exactly you are trying to speed up though. In the optimal case, don’t call into nonzero as there is currently no way around the sync.

It seems like there might be a kernel execution delay before the non-zero operation. Profiling with more granularity is essential. You can check PyTorch’s official documentation or seek guidance on forums like Stack Overflow for detailed profiling and troubleshooting tips regarding your specific setup.

You could explain what “kernel execution delay” means in this context and how it’s related to the issue?