Hello,
I am using torch.utils.bottleneck
to profile some code using CPU and CUDA. I have tried to look for documentation on how to interpret the output but have not found relevant information. Could please anybody help me wih the following?
- What are the time units for the cProfile results? I assume its milliseconds since somewhere I read
bottleneck
uses the event API. - What is the difference between the
Self CPU total
andCPU total
columns? - Below is the output of
torch.utils.bottleneck
for my script. It seems the call toto
is taken most of the time in CPU. Is this right? - The
percall
columns incProfile
show a value of0.000
, does this mean the value is smaller than 1microsecond (or maybe >1ms) and could not get printed? Or there was some error and that value could not be displayed?
I will appreciate anybody could point out any document explainig the contents of this output in more detail. Thanks!
--------------------------------------------------------------------------------
Environment Summary
--------------------------------------------------------------------------------
PyTorch 1.2.0 compiled w/ CUDA 10.0.130
Running with Python 2.7 and CUDA 10.0.130
`pip list` truncated output:
numpy==1.15.1
torch==0.4.0
torchvision==0.2.1
--------------------------------------------------------------------------------
cProfile output
--------------------------------------------------------------------------------
832891 function calls (825332 primitive calls) in 4.975 seconds
Ordered by: internal time
List reduced from 3502 to 15 due to restriction <15>
ncalls tottime percall cumtime percall filename:lineno(function)
18009 2.662 0.000 2.662 0.000 {method 'to' of 'torch._C._TensorBase' objects}
1000 0.164 0.000 1.858 0.002 geometric_controller_gpu.py:103(position_controller)
1000 0.136 0.000 0.330 0.000 geometric_controller_gpu.py:222(desired_body_rate)
6008 0.126 0.000 0.126 0.000 {tensor}
6000 0.089 0.000 0.089 0.000 {addcmul}
8000 0.089 0.000 0.089 0.000 {cross}
3000 0.088 0.000 0.088 0.000 {dot}
7000 0.068 0.000 0.068 0.000 {add}
2000 0.066 0.000 0.186 0.000 /home/fer/git_clone/riseq_ws/src/riseq_common/src/riseq_common/torch_dyn_utils.py:117(rotation_matrix_distance)
3000 0.065 0.000 0.065 0.000 {cat}
5000 0.065 0.000 0.065 0.000 {frobenius_norm}
2000 0.061 0.000 0.061 0.000 {trace}
6000 0.059 0.000 0.059 0.000 {mul}
11 0.054 0.005 0.054 0.005 {cPickle.loads}
9000 0.052 0.000 0.052 0.000 {method 'type' of 'torch._C._TensorBase' objects}
--------------------------------------------------------------------------------
autograd profiler output (CPU mode)
--------------------------------------------------------------------------------
top 15 events sorted by cpu_time_total
----------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- -----------------------------------
Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls Input Shapes
----------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- -----------------------------------
select 33.87% 2.194ms 33.87% 2.194ms 2.194ms NaN 0.000us 0.000us 1 []
dot 16.23% 1.051ms 16.23% 1.051ms 1.051ms NaN 0.000us 0.000us 1 []
item 8.58% 555.789us 8.58% 555.789us 555.789us NaN 0.000us 0.000us 1 []
_local_scalar_dense 8.57% 554.887us 8.57% 554.887us 554.887us NaN 0.000us 0.000us 1 []
to 7.25% 469.485us 7.25% 469.485us 469.485us NaN 0.000us 0.000us 1 []
div 6.67% 431.912us 6.67% 431.912us 431.912us NaN 0.000us 0.000us 1 []
to 2.57% 166.272us 2.57% 166.272us 166.272us NaN 0.000us 0.000us 1 []
item 2.44% 158.373us 2.44% 158.373us 158.373us NaN 0.000us 0.000us 1 []
_local_scalar_dense 2.43% 157.503us 2.43% 157.503us 157.503us NaN 0.000us 0.000us 1 []
trace 2.43% 157.433us 2.43% 157.433us 157.433us NaN 0.000us 0.000us 1 []
item 1.94% 125.770us 1.94% 125.770us 125.770us NaN 0.000us 0.000us 1 []
_local_scalar_dense 1.93% 125.012us 1.93% 125.012us 125.012us NaN 0.000us 0.000us 1 []
trace 1.83% 118.830us 1.83% 118.830us 118.830us NaN 0.000us 0.000us 1 []
to 1.73% 111.795us 1.73% 111.795us 111.795us NaN 0.000us 0.000us 1 []
to 1.54% 99.455us 1.54% 99.455us 99.455us NaN 0.000us 0.000us 1 []
----------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- -----------------------------------
Self CPU time total: 6.478ms
CUDA time total: 0.000us
--------------------------------------------------------------------------------
autograd profiler output (CUDA mode)
--------------------------------------------------------------------------------
top 15 events sorted by cpu_time_total
Because the autograd profiler uses the CUDA event API,
the CUDA time column reports approximately max(cuda_time, cpu_time).
Please ignore this output if your code does not use CUDA.
----------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- -----------------------------------
Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls Input Shapes
----------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- -----------------------------------
to 12.78% 2.265ms 12.78% 2.265ms 2.265ms 13.28% 2.265ms 2.265ms 1 []
empty 12.68% 2.248ms 12.68% 2.248ms 2.248ms 13.19% 2.249ms 2.249ms 1 []
detach_ 12.33% 2.186ms 12.33% 2.186ms 2.186ms 12.82% 2.187ms 2.187ms 1 []
dot 6.65% 1.179ms 6.65% 1.179ms 1.179ms 2.88% 491.521us 491.521us 1 []
item 5.42% 961.185us 5.42% 961.185us 961.185us 5.64% 961.500us 961.500us 1 []
_local_scalar_dense 5.38% 953.277us 5.38% 953.277us 953.277us 5.61% 956.000us 956.000us 1 []
to 5.34% 946.186us 5.34% 946.186us 946.186us 5.55% 946.500us 946.500us 1 []
empty 5.20% 921.878us 5.20% 921.878us 921.878us 5.42% 924.500us 924.500us 1 []
to 5.04% 894.055us 5.04% 894.055us 894.055us 5.25% 894.750us 894.750us 1 []
to 4.96% 878.520us 4.96% 878.520us 878.520us 5.15% 878.750us 878.750us 1 []
empty 4.93% 873.837us 4.93% 873.837us 873.837us 5.14% 876.750us 876.750us 1 []
item 4.87% 863.701us 4.87% 863.701us 863.701us 5.07% 863.750us 863.750us 1 []
_local_scalar_dense 4.84% 857.032us 4.84% 857.032us 857.032us 5.04% 859.750us 859.750us 1 []
empty 4.83% 856.190us 4.83% 856.190us 856.190us 5.04% 858.750us 858.750us 1 []
to 4.73% 838.839us 4.73% 838.839us 838.839us 4.92% 838.750us 838.750us 1 []
----------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- -----------------------------------
Self CPU time total: 17.724ms
CUDA time total: 17.052ms