CPU vs GPU timing of cuda operations

Hi everyone,

I’m profiling a large network with torch.autograd.profiler.profile(use_cuda=True). As far as I can tell, I run everything on the GPU, but still get quite a few operations where the profiler reports much more time spent on the CPU than on the GPU. Some examples are:

N5torch8autograd13CopyBackwardsE:
     CPU: 87555.573us      GPU: 1532.959us
mse_loss:
     CPU: 1767.147us      GPU: 93.445us
avg_pool3d_backward:
     CPU: 2655.118us      GPU: 183.594us
add:
     CPU: 10984.388us      GPU: 131.836us
CatBackward:
     CPU: 24874.143us      GPU: 13.672us

I’m expecting that some of this has to do with the lack of synchronization between CPU and GPU (when I call torch.cuda.synchronize(), some values change).
I’ve tried other profilers (nvprof+nvvp) and get similar results.

My question: If this issue is due to the lack of synchronization, how do I get more accurate execution times?
If it is not due to synchronization and the values are correct, are there tricks to reduce the amount of CPU time? I expected most of these functions to be highly parallelized and benefit greatly from GPU kernels…

1 Like

It seem the high CPU times were due to the asynchronous execution of the CPU and GPU instructions (see also this reply).

To get better values, I ran:

CUDA_LAUNCH_BLOCKING=1 python3 profileNetwork.py

Now the CPU times for the functions reported above are almost the same as their GPU times.

If I understand correctly, the CUDA_LAUNCH_BLOCKING flag ensures that when a CPU instruction is waiting for a result from the GPU, the waiting time is no longer accumulated into the reported CPU time.

“the CUDA_LAUNCH_BLOCKING flag ensures that when a CPU instruction is waiting for a result from the GPU, the waiting time is no longer accumulated into the reported CPU time.” ← Where can you find this claim? I do not seem to find any support about this anywhere on forums or discussions.

You can find info about this in the cuda doc: Programming Guide :: CUDA Toolkit Documentation

1 Like

I cannot find any such thing on that webpage. Can you please paste a screenshot or copy the exact sentence from that page that claims this → “the CUDA_LAUNCH_BLOCKING flag ensures that when a CPU instruction is waiting for a result from the GPU, the waiting time is no longer accumulated into the reported CPU time.”

Ho what it says is a bit different:
The cuda call will wait for the execution to finish.
Meaning that no CPU instruction will have to wait for a result anymore.
And so this distortion won’t happen.

1 Like