Pytorch Profiler CPU and GPU time

Hello, I am trying to compare execution times of two different models and I have the following outputs:

MODEL 1
---------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Name                               Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    CPU Mem          Self CPU Mem     CUDA Mem         Self CUDA Mem    Number of Calls
---------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
model_inference                    3.78%            10.698ms         100.00%          283.100ms        283.100ms        21.06%           283.099ms        283.099ms        0 b              0 b              1.59 Mb          -5.07 Gb         1
convolution                        0.22%            636.759us        83.34%           235.940ms        4.369ms          16.89%           227.029ms        4.204ms          0 b              0 b              2.49 Gb          0 b              54
_convolution                       2.04%            5.770ms          83.12%           235.304ms        4.357ms          16.87%           226.761ms        4.199ms          0 b              0 b              2.49 Gb          0 b              54
conv2d                             0.23%            657.154us        79.73%           225.702ms        4.340ms          16.07%           215.954ms        4.153ms          0 b              0 b              2.49 Gb          0 b              52
cudnn_convolution                  77.90%           220.547ms        79.52%           225.134ms        6.085ms          15.74%           211.566ms        5.718ms          0 b              0 b              1.63 Gb          -2.74 Mb         37
batch_norm                         0.45%            1.263ms          5.16%            14.609ms         280.941us        2.01%            26.963ms         518.524us        0 b              0 b              2.49 Gb          -158.00 Kb       52
_batch_norm_impl_index             0.83%            2.362ms          4.71%            13.346ms         256.651us        1.97%            26.514ms         509.885us        0 b              0 b              2.49 Gb          0 b              52
lstm                               0.04%            119.700us        4.14%            11.712ms         11.712ms         0.88%            11.866ms         11.866ms         0 b              -8 b             2.08 Mb          0 b              1
conv1d                             0.01%            25.260us         3.86%            10.920ms         5.460ms          0.84%            11.349ms         5.674ms          0 b              0 b              1012.00 Kb       0 b              2
_cudnn_init_dropout_state          3.39%            9.584ms          3.39%            9.595ms          9.595ms          0.71%            9.595ms          9.595ms          0 b              0 b              1.59 Mb          0 b              1
---------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Self CPU time total: 283.100ms
CUDA time total: 1.344s



MODEL 2
---------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Name                               Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    CPU Mem          Self CPU Mem     CUDA Mem         Self CUDA Mem    Number of Calls
---------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
model_inference                    4.72%            9.301ms          100.00%          197.095ms        197.095ms        21.51%           197.095ms        197.095ms        0 b              0 b              1.59 Mb          -3.10 Gb         1
convolution                        0.28%            557.852us        77.57%           152.878ms        2.831ms          16.30%           149.410ms        2.767ms          0 b              0 b              1.51 Gb          0 b              54
_convolution                       1.47%            2.896ms          77.28%           152.320ms        2.821ms          16.28%           149.156ms        2.762ms          0 b              0 b              1.51 Gb          0 b              54
cudnn_convolution                  70.81%           139.564ms        72.86%           143.598ms        3.881ms          15.21%           139.411ms        3.768ms          0 b              0 b              1.01 Gb          -154.50 Kb       37
conv2d                             0.29%            573.546us        71.73%           141.381ms        2.719ms          15.03%           137.764ms        2.649ms          0 b              0 b              1.51 Gb          0 b              52
batch_norm                         0.56%            1.110ms          6.39%            12.598ms         242.262us        2.13%            19.510ms         375.183us        0 b              0 b              1.51 Gb          -102.00 Kb       52
conv1d                             0.02%            30.088us         6.14%            12.100ms         6.050ms          1.30%            11.905ms         5.952ms          0 b              0 b              1012.00 Kb       0 b              2
_batch_norm_impl_index             1.10%            2.177ms          5.83%            11.488ms         220.918us        2.08%            19.046ms         366.269us        0 b              0 b              1.51 Gb          0 b              52
lstm                               0.05%            108.296us        5.35%            10.541ms         10.541ms         1.15%            10.540ms         10.540ms         0 b              -8 b             2.08 Mb          0 b              1
empty                              4.18%            8.245ms          4.18%            8.245ms          19.867us         0.71%            6.509ms          15.685us         8 b              8 b              2.65 Gb          2.65 Gb          415
---------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Self CPU time total: 197.095ms
CUDA time total: 916.409ms

Can anyone explain understand the huge differences on CUDA total time vs CPU time ?

2 Likes

Hi,

I think the CPU total is the amound of time the CPU is actively doing stuff. And the CUDA time is the amount of time the GPU is actively doing stuff.
So in your case, the CPU doesn’t have much to do and the GPU is doing all the heavy lifting (and the CPU just waits for the GPU to finish its work).

and what about the memory needed for inference? is there a way to print it (like the cuda time for example?) . or to calculate it based on the above results?

1 Like

Memory profiling was added very recently. You can use the nightly builds and check the latest doc https://pytorch.org/docs/master/autograd.html#profiler) for details on how to use it.

1 Like

Can execution of CPU and GPU overlap for a single kernel?
I am trying to calculate the total amount of time it takes for a kernel to finish.
From observations, it seems that the total time for a kernel cannot be calculated simply as “cpu_time + cuda_time”.
Any suggestions? Thanks.

Yes, CUDA operations are executed asynchronously so to profile them you would need to synchronize the code before starting and stopping the timers via torch.cuda.synchronize().

I don’t understand this claim. If you are not synchronizing, the CPU would run ahead and just stop the timer directly while the GPU is still busy.