Hi all,
I’m training my network and I’m only seeing around 10% GPU utilization and 25% CPU utilization, so after running torch.utils.bottleneck, I got the following result, but I’m not sure how to interpt this, and the (documentation) doesn’t help too much.
Self CPU time total: 11.741s
CUDA time total: 2.928ms
From this I’m assuming that my model is models being run on the CPU rather than the GPU? (I am doing most of the calculation on the GPU via .to(device)) so that is strange.
method 'item' of 'torch._C._TensorBase' objects
How do I interpret this method (among others) so that I can see where the bottleneck lies so that the training/model can be improved?
I will appreciate anybody could point out any document explainig the contents of this output in more detail. Thanks!
Environment Summary
PyTorch 1.7.0 DEBUG compiled w/ CUDA 11.0
Running with Python 3.8 and
pip3 list
truncated output:
numpy==1.18.5
numpydoc==1.1.0
torch==1.7.0
torch-summary==1.4.3
torchaudio==0.7.0
torchvision==0.8.1
cProfile output
2103854 function calls (2070741 primitive calls) in 289.909 seconds
Ordered by: internal time
List reduced from 4738 to 15 due to restriction <15>
ncalls tottime percall cumtime percall filename:lineno(function)
2284 251.530 0.110 251.530 0.110 {method 'item' of 'torch._C._TensorBase' objects}
1322 13.616 0.010 13.616 0.010 {method 'acquire' of '_thread.lock' objects}
1098 8.855 0.008 8.855 0.008 {method 'to' of 'torch._C._TensorBase' objects}
393 5.636 0.014 5.636 0.014 {method 'run_backward' of 'torch._C._EngineBase' objects}
9477 4.579 0.000 4.579 0.000 {built-in method conv2d}
12 0.646 0.054 0.646 0.054 {built-in method _winapi.WaitForSingleObject}
798 0.576 0.001 0.576 0.001 {method 'matmul' of 'torch._C._TensorBase' objects}
399 0.468 0.001 0.468 0.001 {built-in method lstm}
16 0.418 0.026 0.421 0.026 {method 'dump' of '_pickle.Pickler' objects}
1 0.313 0.313 0.313 0.313 {built-in method _cudnn_rnn_flatten_weight}
30 0.294 0.010 0.294 0.010 {method 'uniform_' of 'torch._C._TensorBase' objects}
82590 0.210 0.000 0.329 0.000 C:\ProgramData\Anaconda3\lib\site-packages\torch\tensor.py:933(grad)
786 0.184 0.000 0.773 0.001 C:\ProgramData\Anaconda3\lib\site-packages\torch\optim\functional.py:53(adam)
23580 0.179 0.000 0.179 0.000 {method 'mul_' of 'torch._C._TensorBase' objects}
1 0.154 0.154 289.910 289.910 .\main.py:2(<module>)
autograd profiler output (CPU mode)
top 15 events sorted by cpu_time_total
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
aten::item 7.05% 767.910ms 7.05% 767.910ms 767.910ms 0.000us NaN 0.000us 0.000us 1
aten::_local_scalar_dense 7.05% 767.906ms 7.05% 767.906ms 767.906ms 0.000us NaN 0.000us 0.000us 1
aten::item 7.01% 762.924ms 7.01% 762.924ms 762.924ms 0.000us NaN 0.000us 0.000us 1
aten::_local_scalar_dense 7.01% 762.920ms 7.01% 762.920ms 762.920ms 0.000us NaN 0.000us 0.000us 1
aten::item 6.64% 723.086ms 6.64% 723.086ms 723.086ms 0.000us NaN 0.000us 0.000us 1
aten::_local_scalar_dense 6.64% 723.082ms 6.64% 723.082ms 723.082ms 0.000us NaN 0.000us 0.000us 1
aten::item 6.58% 716.097ms 6.58% 716.097ms 716.097ms 0.000us NaN 0.000us 0.000us 1
aten::_local_scalar_dense 6.58% 716.091ms 6.58% 716.091ms 716.091ms 0.000us NaN 0.000us 0.000us 1
aten::item 6.55% 713.050ms 6.55% 713.050ms 713.050ms 0.000us NaN 0.000us 0.000us 1
aten::_local_scalar_dense 6.55% 713.046ms 6.55% 713.046ms 713.046ms 0.000us NaN 0.000us 0.000us 1
aten::item 6.51% 709.503ms 6.51% 709.503ms 709.503ms 0.000us NaN 0.000us 0.000us 1
aten::_local_scalar_dense 6.51% 709.499ms 6.51% 709.499ms 709.499ms 0.000us NaN 0.000us 0.000us 1
aten::item 6.45% 702.503ms 6.45% 702.503ms 702.503ms 0.000us NaN 0.000us 0.000us 1
aten::_local_scalar_dense 6.45% 702.499ms 6.45% 702.499ms 702.499ms 0.000us NaN 0.000us 0.000us 1
aten::item 6.43% 700.667ms 6.43% 700.667ms 700.667ms 0.000us NaN 0.000us 0.000us 1
Self CPU time total: 10.891s
CUDA time total: 0.000us
autograd profiler output (CUDA mode)
top 15 events sorted by cpu_time_total
Because the autograd profiler uses the CUDA event API,
the CUDA time column reports approximately max(cuda_time, cpu_time).
Please ignore this output if your code does not use CUDA.
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
aten::item 6.89% 808.622ms 6.89% 808.622ms 808.622ms 208.000us 7.10% 208.000us 208.000us 1
aten::_local_scalar_dense 6.89% 808.615ms 6.89% 808.615ms 808.615ms 192.000us 6.56% 192.000us 192.000us 1
aten::item 6.84% 803.543ms 6.84% 803.543ms 803.543ms 176.000us 6.01% 176.000us 176.000us 1
aten::_local_scalar_dense 6.84% 803.536ms 6.84% 803.536ms 803.536ms 176.000us 6.01% 176.000us 176.000us 1
aten::item 6.83% 801.321ms 6.83% 801.321ms 801.321ms 192.000us 6.56% 192.000us 192.000us 1
aten::_local_scalar_dense 6.83% 801.314ms 6.83% 801.314ms 801.314ms 176.000us 6.01% 176.000us 176.000us 1
aten::item 6.75% 792.760ms 6.75% 792.760ms 792.760ms 208.000us 7.10% 208.000us 208.000us 1
aten::_local_scalar_dense 6.75% 792.753ms 6.75% 792.753ms 792.753ms 208.000us 7.10% 208.000us 208.000us 1
aten::item 6.55% 769.503ms 6.55% 769.503ms 769.503ms 176.000us 6.01% 176.000us 176.000us 1
aten::_local_scalar_dense 6.55% 769.497ms 6.55% 769.497ms 769.497ms 176.000us 6.01% 176.000us 176.000us 1
aten::item 6.52% 765.528ms 6.52% 765.528ms 765.528ms 208.000us 7.10% 208.000us 208.000us 1
aten::_local_scalar_dense 6.52% 765.521ms 6.52% 765.521ms 765.521ms 192.000us 6.56% 192.000us 192.000us 1
aten::item 6.42% 753.558ms 6.42% 753.558ms 753.558ms 208.000us 7.10% 208.000us 208.000us 1
aten::_local_scalar_dense 6.42% 753.551ms 6.42% 753.551ms 753.551ms 192.000us 6.56% 192.000us 192.000us 1
aten::item 6.40% 751.019ms 6.40% 751.019ms 751.019ms 240.000us 8.20% 240.000us 240.000us 1
Self CPU time total: 11.741s
CUDA time total: 2.928ms