i’m looking for general tips and pointers on how to look at the following bottleneck report.
most specifically, there’s a call to pickler that occurs many times and i am not sure where in my code it is called.
soo my main question is - how to track the call to pickler method?
update
i think the call to pickler is related to num_workers and working on windows. is there anything i can do to avoid the delay still?
the report -
--------------------------------------------------------------------------------
Environment Summary
--------------------------------------------------------------------------------
PyTorch 1.7.0 DEBUG compiled w/ CUDA 10.2
Running with Python 3.6 and CUDA 10.2.89
`pip3 list` truncated output:
numpy==1.19.2
torch==1.7.0
torchaudio==0.7.0
torchfile==0.1.0
torchnet==0.0.4
torchsummary==1.5.1
torchsummaryX==1.3.0
torchvision==0.8.1
--------------------------------------------------------------------------------
cProfile output
--------------------------------------------------------------------------------
14020759 function calls (13677326 primitive calls) in 383.299 seconds
Ordered by: internal time
List reduced from 7342 to 15 due to restriction <15>
ncalls tottime percall cumtime percall filename:lineno(function)
1280 164.908 0.129 164.908 0.129 {method 'run_backward' of 'torch._C._EngineBase' objects}
174 147.153 0.846 147.171 0.846 {method 'dump' of '_pickle.Pickler' objects}
2856 12.439 0.004 12.439 0.004 {method 'addConstraint' of 'kiwisolver.Solver' objects}
6227 11.088 0.002 11.088 0.002 {method 'acquire' of '_thread.lock' objects}
48060 6.492 0.000 6.492 0.000 {built-in method conv3d}
40050 3.384 0.000 3.384 0.000 {built-in method instance_norm}
3229 3.126 0.001 3.126 0.001 {method 'cuda' of 'torch._C._TensorBase' objects}
144 2.704 0.019 2.704 0.019 {built-in method _winapi.WaitForSingleObject}
181024/3202 1.350 0.000 17.682 0.006 D:\ProgramData\Anaconda3\envs\env_zoo\lib\site-packages\torch\nn\modules\module.py:715(_call_impl)
44856 1.268 0.000 1.268 0.000 {built-in method torch._C._nn.leaky_relu}
1280 1.156 0.001 5.191 0.004 D:\ProgramData\Anaconda3\envs\env_zoo\lib\site-packages\torch\optim\functional.py:53(adam)
232908 1.142 0.000 2.557 0.000 D:\ProgramData\Anaconda3\envs\env_zoo\lib\site-packages\torch\tensor.py:933(grad)
66560 1.021 0.000 1.021 0.000 {method 'mul_' of 'torch._C._TensorBase' objects}
33254 0.968 0.000 0.968 0.000 {method 'zero_' of 'torch._C._TensorBase' objects}
1602 0.882 0.001 17.403 0.011 D:\users\Gony\FMRI_inpainting1\lib\medzoo\Unet3D_gony.py:121(forward)
--------------------------------------------------------------------------------
autograd profiler output (CPU mode)
--------------------------------------------------------------------------------
top 15 events sorted by cpu_time_total
------------------------------------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
aten::addcmul_ 11.60% 1.068s 11.60% 1.068s 1.068s 0.000us NaN 0.000us 0.000us 1
aten::instance_norm 7.91% 728.799ms 7.91% 728.799ms 728.799ms 0.000us NaN 0.000us 0.000us 1
aten::batch_norm 7.91% 728.760ms 7.91% 728.760ms 728.760ms 0.000us NaN 0.000us 0.000us 1
aten::_batch_norm_impl_index 7.91% 728.753ms 7.91% 728.753ms 728.753ms 0.000us NaN 0.000us 0.000us 1
aten::native_batch_norm 7.91% 728.738ms 7.91% 728.738ms 728.738ms 0.000us NaN 0.000us 0.000us 1
ViewBackward 7.57% 697.112ms 7.57% 697.112ms 697.112ms 0.000us NaN 0.000us 0.000us 1
aten::reshape 7.57% 697.101ms 7.57% 697.101ms 697.101ms 0.000us NaN 0.000us 0.000us 1
aten::view 7.57% 697.098ms 7.57% 697.098ms 697.098ms 0.000us NaN 0.000us 0.000us 1
aten::div 5.13% 472.212ms 5.13% 472.212ms 472.212ms 0.000us NaN 0.000us 0.000us 1
ViewBackward 4.87% 449.067ms 4.87% 449.067ms 449.067ms 0.000us NaN 0.000us 0.000us 1
aten::reshape 4.87% 449.064ms 4.87% 449.064ms 449.064ms 0.000us NaN 0.000us 0.000us 1
aten::view 4.87% 449.061ms 4.87% 449.061ms 449.061ms 0.000us NaN 0.000us 0.000us 1
struct torch::autograd::AccumulateGrad 4.78% 440.524ms 4.78% 440.524ms 440.524ms 0.000us NaN 0.000us 0.000us 1
aten::add_ 4.77% 439.220ms 4.77% 439.220ms 439.220ms 0.000us NaN 0.000us 0.000us 1
struct torch::autograd::AccumulateGrad 4.76% 438.775ms 4.76% 438.775ms 438.775ms 0.000us NaN 0.000us 0.000us 1
------------------------------------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 9.213s
CUDA time total: 0.000us
--------------------------------------------------------------------------------
autograd profiler output (CUDA mode)
--------------------------------------------------------------------------------
top 15 events sorted by cpu_time_total
Because the autograd profiler uses the CUDA event API,
the CUDA time column reports approximately max(cuda_time, cpu_time).
Please ignore this output if your code does not use CUDA.
------------------------------------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
aten::addcdiv_ 13.27% 1.224s 13.27% 1.224s 1.224s 1.224s 16.98% 1.224s 1.224s 1
ViewBackward 8.58% 791.255ms 8.58% 791.255ms 791.255ms 786.256ms 10.90% 786.256ms 786.256ms 1
aten::reshape 8.57% 791.223ms 8.57% 791.223ms 791.223ms 786.240ms 10.90% 786.240ms 786.240ms 1
aten::view 8.57% 791.197ms 8.57% 791.197ms 791.197ms 0.000us 0.00% 0.000us 0.000us 1
aten::instance_norm 8.04% 742.228ms 8.04% 742.228ms 742.228ms 742.256ms 10.29% 742.256ms 742.256ms 1
aten::view 8.04% 741.937ms 8.04% 741.937ms 741.937ms 0.000us 0.00% 0.000us 0.000us 1
aten::add_ 5.36% 494.861ms 5.36% 494.861ms 494.861ms 494.864ms 6.86% 494.864ms 494.864ms 1
ViewBackward 5.08% 468.636ms 5.08% 468.636ms 468.636ms 466.000ms 6.46% 466.000ms 466.000ms 1
aten::reshape 5.08% 468.615ms 5.08% 468.615ms 468.615ms 466.000ms 6.46% 466.000ms 466.000ms 1
aten::view 5.08% 468.590ms 5.08% 468.590ms 468.590ms 0.000us 0.00% 0.000us 0.000us 1
struct torch::autograd::AccumulateGrad 4.97% 459.025ms 4.97% 459.025ms 459.025ms 459.040ms 6.37% 459.040ms 459.040ms 1
aten::add_ 4.96% 457.739ms 4.96% 457.739ms 457.739ms 457.760ms 6.35% 457.760ms 457.760ms 1
struct torch::autograd::AccumulateGrad 4.85% 447.087ms 4.85% 447.087ms 447.087ms 447.104ms 6.20% 447.104ms 447.104ms 1
aten::add_ 4.83% 445.955ms 4.83% 445.955ms 445.955ms 445.968ms 6.19% 445.968ms 445.968ms 1
struct torch::autograd::AccumulateGrad 4.71% 434.607ms 4.71% 434.607ms 434.607ms 434.624ms 6.03% 434.624ms 434.624ms 1
------------------------------------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 9.227s
CUDA time total: 7.210s