I’m profiling a large network with torch.autograd.profiler.profile(use_cuda=True). As far as I can tell, I run everything on the GPU, but still get quite a few operations where the profiler reports much more time spent on the CPU than on the GPU. Some examples are:
I’m expecting that some of this has to do with the lack of synchronization between CPU and GPU (when I call torch.cuda.synchronize(), some values change).
I’ve tried other profilers (nvprof+nvvp) and get similar results.
My question: If this issue is due to the lack of synchronization, how do I get more accurate execution times?
If it is not due to synchronization and the values are correct, are there tricks to reduce the amount of CPU time? I expected most of these functions to be highly parallelized and benefit greatly from GPU kernels…
It seem the high CPU times were due to the asynchronous execution of the CPU and GPU instructions (see also this reply).
To get better values, I ran:
CUDA_LAUNCH_BLOCKING=1 python3 profileNetwork.py
Now the CPU times for the functions reported above are almost the same as their GPU times.
If I understand correctly, the CUDA_LAUNCH_BLOCKING flag ensures that when a CPU instruction is waiting for a result from the GPU, the waiting time is no longer accumulated into the reported CPU time.
“the CUDA_LAUNCH_BLOCKING flag ensures that when a CPU instruction is waiting for a result from the GPU, the waiting time is no longer accumulated into the reported CPU time.” ← Where can you find this claim? I do not seem to find any support about this anywhere on forums or discussions.
I cannot find any such thing on that webpage. Can you please paste a screenshot or copy the exact sentence from that page that claims this → “the CUDA_LAUNCH_BLOCKING flag ensures that when a CPU instruction is waiting for a result from the GPU, the waiting time is no longer accumulated into the reported CPU time.”
Ho what it says is a bit different:
The cuda call will wait for the execution to finish.
Meaning that no CPU instruction will have to wait for a result anymore.
And so this distortion won’t happen.