How to interpret torch.utils.bottleneck summary

Hi all,

I’m training my network and I’m only seeing around 10% GPU utilization and 25% CPU utilization, so after running torch.utils.bottleneck, I got the following result, but I’m not sure how to interpt this, and the (documentation) doesn’t help too much.

Self CPU time total: 11.741s
CUDA time total: 2.928ms

From this I’m assuming that my model is models being run on the CPU rather than the GPU? (I am doing most of the calculation on the GPU via .to(device)) so that is strange.

method 'item' of 'torch._C._TensorBase' objects

How do I interpret this method (among others) so that I can see where the bottleneck lies so that the training/model can be improved?

I will appreciate anybody could point out any document explainig the contents of this output in more detail. Thanks!


Environment Summary

PyTorch 1.7.0 DEBUG compiled w/ CUDA 11.0
Running with Python 3.8 and

pip3 list truncated output:
numpy==1.18.5
numpydoc==1.1.0
torch==1.7.0
torch-summary==1.4.3
torchaudio==0.7.0
torchvision==0.8.1

cProfile output

     2103854 function calls (2070741 primitive calls) in 289.909 seconds

Ordered by: internal time
List reduced from 4738 to 15 due to restriction <15>


ncalls tottime percall cumtime percall filename:lineno(function)


 2284  251.530    0.110  251.530    0.110 {method 'item' of 'torch._C._TensorBase' objects}
 1322   13.616    0.010   13.616    0.010 {method 'acquire' of '_thread.lock' objects}
 1098    8.855    0.008    8.855    0.008 {method 'to' of 'torch._C._TensorBase' objects}
  393    5.636    0.014    5.636    0.014 {method 'run_backward' of 'torch._C._EngineBase' objects}
 9477    4.579    0.000    4.579    0.000 {built-in method conv2d}
   12    0.646    0.054    0.646    0.054 {built-in method _winapi.WaitForSingleObject}
  798    0.576    0.001    0.576    0.001 {method 'matmul' of 'torch._C._TensorBase' objects}
  399    0.468    0.001    0.468    0.001 {built-in method lstm}
   16    0.418    0.026    0.421    0.026 {method 'dump' of '_pickle.Pickler' objects}
    1    0.313    0.313    0.313    0.313 {built-in method _cudnn_rnn_flatten_weight}
   30    0.294    0.010    0.294    0.010 {method 'uniform_' of 'torch._C._TensorBase' objects}
82590    0.210    0.000    0.329    0.000 C:\ProgramData\Anaconda3\lib\site-packages\torch\tensor.py:933(grad)
  786    0.184    0.000    0.773    0.001 C:\ProgramData\Anaconda3\lib\site-packages\torch\optim\functional.py:53(adam)
23580    0.179    0.000    0.179    0.000 {method 'mul_' of 'torch._C._TensorBase' objects}
    1    0.154    0.154  289.910  289.910 .\main.py:2(<module>)


autograd profiler output (CPU mode)

    top 15 events sorted by cpu_time_total

                     Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls

               aten::item         7.05%     767.910ms         7.05%     767.910ms     767.910ms       0.000us           NaN       0.000us       0.000us             1
aten::_local_scalar_dense         7.05%     767.906ms         7.05%     767.906ms     767.906ms       0.000us           NaN       0.000us       0.000us             1
               aten::item         7.01%     762.924ms         7.01%     762.924ms     762.924ms       0.000us           NaN       0.000us       0.000us             1
aten::_local_scalar_dense         7.01%     762.920ms         7.01%     762.920ms     762.920ms       0.000us           NaN       0.000us       0.000us             1
               aten::item         6.64%     723.086ms         6.64%     723.086ms     723.086ms       0.000us           NaN       0.000us       0.000us             1
aten::_local_scalar_dense         6.64%     723.082ms         6.64%     723.082ms     723.082ms       0.000us           NaN       0.000us       0.000us             1
               aten::item         6.58%     716.097ms         6.58%     716.097ms     716.097ms       0.000us           NaN       0.000us       0.000us             1
aten::_local_scalar_dense         6.58%     716.091ms         6.58%     716.091ms     716.091ms       0.000us           NaN       0.000us       0.000us             1
               aten::item         6.55%     713.050ms         6.55%     713.050ms     713.050ms       0.000us           NaN       0.000us       0.000us             1
aten::_local_scalar_dense         6.55%     713.046ms         6.55%     713.046ms     713.046ms       0.000us           NaN       0.000us       0.000us             1
               aten::item         6.51%     709.503ms         6.51%     709.503ms     709.503ms       0.000us           NaN       0.000us       0.000us             1
aten::_local_scalar_dense         6.51%     709.499ms         6.51%     709.499ms     709.499ms       0.000us           NaN       0.000us       0.000us             1
               aten::item         6.45%     702.503ms         6.45%     702.503ms     702.503ms       0.000us           NaN       0.000us       0.000us             1
aten::_local_scalar_dense         6.45%     702.499ms         6.45%     702.499ms     702.499ms       0.000us           NaN       0.000us       0.000us             1
               aten::item         6.43%     700.667ms         6.43%     700.667ms     700.667ms       0.000us           NaN       0.000us       0.000us             1

Self CPU time total: 10.891s
CUDA time total: 0.000us


autograd profiler output (CUDA mode)

    top 15 events sorted by cpu_time_total

    Because the autograd profiler uses the CUDA event API,
    the CUDA time column reports approximately max(cuda_time, cpu_time).
    Please ignore this output if your code does not use CUDA.

                     Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls

               aten::item         6.89%     808.622ms         6.89%     808.622ms     808.622ms     208.000us         7.10%     208.000us     208.000us             1
aten::_local_scalar_dense         6.89%     808.615ms         6.89%     808.615ms     808.615ms     192.000us         6.56%     192.000us     192.000us             1
               aten::item         6.84%     803.543ms         6.84%     803.543ms     803.543ms     176.000us         6.01%     176.000us     176.000us             1
aten::_local_scalar_dense         6.84%     803.536ms         6.84%     803.536ms     803.536ms     176.000us         6.01%     176.000us     176.000us             1
               aten::item         6.83%     801.321ms         6.83%     801.321ms     801.321ms     192.000us         6.56%     192.000us     192.000us             1
aten::_local_scalar_dense         6.83%     801.314ms         6.83%     801.314ms     801.314ms     176.000us         6.01%     176.000us     176.000us             1
               aten::item         6.75%     792.760ms         6.75%     792.760ms     792.760ms     208.000us         7.10%     208.000us     208.000us             1
aten::_local_scalar_dense         6.75%     792.753ms         6.75%     792.753ms     792.753ms     208.000us         7.10%     208.000us     208.000us             1
               aten::item         6.55%     769.503ms         6.55%     769.503ms     769.503ms     176.000us         6.01%     176.000us     176.000us             1
aten::_local_scalar_dense         6.55%     769.497ms         6.55%     769.497ms     769.497ms     176.000us         6.01%     176.000us     176.000us             1
               aten::item         6.52%     765.528ms         6.52%     765.528ms     765.528ms     208.000us         7.10%     208.000us     208.000us             1
aten::_local_scalar_dense         6.52%     765.521ms         6.52%     765.521ms     765.521ms     192.000us         6.56%     192.000us     192.000us             1
               aten::item         6.42%     753.558ms         6.42%     753.558ms     753.558ms     208.000us         7.10%     208.000us     208.000us             1
aten::_local_scalar_dense         6.42%     753.551ms         6.42%     753.551ms     753.551ms     192.000us         6.56%     192.000us     192.000us             1
               aten::item         6.40%     751.019ms         6.40%     751.019ms     751.019ms     240.000us         8.20%     240.000us     240.000us             1

Self CPU time total: 11.741s
CUDA time total: 2.928ms

1 Like