Torch.utils.bottleneck output interpretation

juanmed · September 3, 2019, 6:00pm

Hello,

I am using torch.utils.bottleneck to profile some code using CPU and CUDA. I have tried to look for documentation on how to interpret the output but have not found relevant information. Could please anybody help me wih the following?

What are the time units for the cProfile results? I assume its milliseconds since somewhere I read bottleneck uses the event API.
What is the difference between the Self CPU total and CPU total columns?
Below is the output of torch.utils.bottleneck for my script. It seems the call to to is taken most of the time in CPU. Is this right?
The percall columns in cProfile show a value of 0.000, does this mean the value is smaller than 1microsecond (or maybe >1ms) and could not get printed? Or there was some error and that value could not be displayed?

I will appreciate anybody could point out any document explainig the contents of this output in more detail. Thanks!

--------------------------------------------------------------------------------
  Environment Summary
--------------------------------------------------------------------------------
PyTorch 1.2.0 compiled w/ CUDA 10.0.130
Running with Python 2.7 and CUDA 10.0.130

`pip list` truncated output:
numpy==1.15.1
torch==0.4.0
torchvision==0.2.1
--------------------------------------------------------------------------------
  cProfile output
--------------------------------------------------------------------------------
         832891 function calls (825332 primitive calls) in 4.975 seconds

   Ordered by: internal time
   List reduced from 3502 to 15 due to restriction <15>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    18009    2.662    0.000    2.662    0.000 {method 'to' of 'torch._C._TensorBase' objects}
     1000    0.164    0.000    1.858    0.002 geometric_controller_gpu.py:103(position_controller)
     1000    0.136    0.000    0.330    0.000 geometric_controller_gpu.py:222(desired_body_rate)
     6008    0.126    0.000    0.126    0.000 {tensor}
     6000    0.089    0.000    0.089    0.000 {addcmul}
     8000    0.089    0.000    0.089    0.000 {cross}
     3000    0.088    0.000    0.088    0.000 {dot}
     7000    0.068    0.000    0.068    0.000 {add}
     2000    0.066    0.000    0.186    0.000 /home/fer/git_clone/riseq_ws/src/riseq_common/src/riseq_common/torch_dyn_utils.py:117(rotation_matrix_distance)
     3000    0.065    0.000    0.065    0.000 {cat}
     5000    0.065    0.000    0.065    0.000 {frobenius_norm}
     2000    0.061    0.000    0.061    0.000 {trace}
     6000    0.059    0.000    0.059    0.000 {mul}
       11    0.054    0.005    0.054    0.005 {cPickle.loads}
     9000    0.052    0.000    0.052    0.000 {method 'type' of 'torch._C._TensorBase' objects}


--------------------------------------------------------------------------------
  autograd profiler output (CPU mode)
--------------------------------------------------------------------------------
        top 15 events sorted by cpu_time_total

-----------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------  
Name                     Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  Input Shapes                         
-----------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------  
select                   33.87%           2.194ms          33.87%           2.194ms          2.194ms          NaN              0.000us          0.000us          1                []                                   
dot                      16.23%           1.051ms          16.23%           1.051ms          1.051ms          NaN              0.000us          0.000us          1                []                                   
item                     8.58%            555.789us        8.58%            555.789us        555.789us        NaN              0.000us          0.000us          1                []                                   
_local_scalar_dense      8.57%            554.887us        8.57%            554.887us        554.887us        NaN              0.000us          0.000us          1                []                                   
to                       7.25%            469.485us        7.25%            469.485us        469.485us        NaN              0.000us          0.000us          1                []                                   
div                      6.67%            431.912us        6.67%            431.912us        431.912us        NaN              0.000us          0.000us          1                []                                   
to                       2.57%            166.272us        2.57%            166.272us        166.272us        NaN              0.000us          0.000us          1                []                                   
item                     2.44%            158.373us        2.44%            158.373us        158.373us        NaN              0.000us          0.000us          1                []                                   
_local_scalar_dense      2.43%            157.503us        2.43%            157.503us        157.503us        NaN              0.000us          0.000us          1                []                                   
trace                    2.43%            157.433us        2.43%            157.433us        157.433us        NaN              0.000us          0.000us          1                []                                   
item                     1.94%            125.770us        1.94%            125.770us        125.770us        NaN              0.000us          0.000us          1                []                                   
_local_scalar_dense      1.93%            125.012us        1.93%            125.012us        125.012us        NaN              0.000us          0.000us          1                []                                   
trace                    1.83%            118.830us        1.83%            118.830us        118.830us        NaN              0.000us          0.000us          1                []                                   
to                       1.73%            111.795us        1.73%            111.795us        111.795us        NaN              0.000us          0.000us          1                []                                   
to                       1.54%            99.455us         1.54%            99.455us         99.455us         NaN              0.000us          0.000us          1                []                                   
-----------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------  
Self CPU time total: 6.478ms
CUDA time total: 0.000us

--------------------------------------------------------------------------------
  autograd profiler output (CUDA mode)
--------------------------------------------------------------------------------
        top 15 events sorted by cpu_time_total

	Because the autograd profiler uses the CUDA event API,
	the CUDA time column reports approximately max(cuda_time, cpu_time).
	Please ignore this output if your code does not use CUDA.

-----------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------  
Name                     Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  Input Shapes                         
-----------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------  
to                       12.78%           2.265ms          12.78%           2.265ms          2.265ms          13.28%           2.265ms          2.265ms          1                []                                   
empty                    12.68%           2.248ms          12.68%           2.248ms          2.248ms          13.19%           2.249ms          2.249ms          1                []                                   
detach_                  12.33%           2.186ms          12.33%           2.186ms          2.186ms          12.82%           2.187ms          2.187ms          1                []                                   
dot                      6.65%            1.179ms          6.65%            1.179ms          1.179ms          2.88%            491.521us        491.521us        1                []                                   
item                     5.42%            961.185us        5.42%            961.185us        961.185us        5.64%            961.500us        961.500us        1                []                                   
_local_scalar_dense      5.38%            953.277us        5.38%            953.277us        953.277us        5.61%            956.000us        956.000us        1                []                                   
to                       5.34%            946.186us        5.34%            946.186us        946.186us        5.55%            946.500us        946.500us        1                []                                   
empty                    5.20%            921.878us        5.20%            921.878us        921.878us        5.42%            924.500us        924.500us        1                []                                   
to                       5.04%            894.055us        5.04%            894.055us        894.055us        5.25%            894.750us        894.750us        1                []                                   
to                       4.96%            878.520us        4.96%            878.520us        878.520us        5.15%            878.750us        878.750us        1                []                                   
empty                    4.93%            873.837us        4.93%            873.837us        873.837us        5.14%            876.750us        876.750us        1                []                                   
item                     4.87%            863.701us        4.87%            863.701us        863.701us        5.07%            863.750us        863.750us        1                []                                   
_local_scalar_dense      4.84%            857.032us        4.84%            857.032us        857.032us        5.04%            859.750us        859.750us        1                []                                   
empty                    4.83%            856.190us        4.83%            856.190us        856.190us        5.04%            858.750us        858.750us        1                []                                   
to                       4.73%            838.839us        4.73%            838.839us        838.839us        4.92%            838.750us        838.750us        1                []                                   
-----------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------  
Self CPU time total: 17.724ms
CUDA time total: 17.052ms

nihal-rao · March 30, 2020, 10:12am

A great explanation of the difference between Self CPU Total and CPU Total can be found here :

https://software.intel.com/en-us/vtune-help-self-time-and-total-time

tengerye · July 2, 2020, 2:30am

Hi, I am also interesting in this topic, even more. In the report, which part is important? What is the expected ratio of CPU time and GPU time? Is there any post or documentation could help?