Understanding torch.utills.bottleneck report

i’m looking for general tips and pointers on how to look at the following bottleneck report.

most specifically, there’s a call to pickler that occurs many times and i am not sure where in my code it is called.
soo my main question is - how to track the call to pickler method?

update
i think the call to pickler is related to num_workers and working on windows. is there anything i can do to avoid the delay still?

the report -

--------------------------------------------------------------------------------
  Environment Summary
--------------------------------------------------------------------------------
PyTorch 1.7.0 DEBUG compiled w/ CUDA 10.2
Running with Python 3.6 and CUDA 10.2.89

`pip3 list` truncated output:
numpy==1.19.2
torch==1.7.0
torchaudio==0.7.0
torchfile==0.1.0
torchnet==0.0.4
torchsummary==1.5.1
torchsummaryX==1.3.0
torchvision==0.8.1
--------------------------------------------------------------------------------
  cProfile output
--------------------------------------------------------------------------------
         14020759 function calls (13677326 primitive calls) in 383.299 seconds

   Ordered by: internal time
   List reduced from 7342 to 15 due to restriction <15>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1280  164.908    0.129  164.908    0.129 {method 'run_backward' of 'torch._C._EngineBase' objects}
      174  147.153    0.846  147.171    0.846 {method 'dump' of '_pickle.Pickler' objects}
     2856   12.439    0.004   12.439    0.004 {method 'addConstraint' of 'kiwisolver.Solver' objects}
     6227   11.088    0.002   11.088    0.002 {method 'acquire' of '_thread.lock' objects}
    48060    6.492    0.000    6.492    0.000 {built-in method conv3d}
    40050    3.384    0.000    3.384    0.000 {built-in method instance_norm}
     3229    3.126    0.001    3.126    0.001 {method 'cuda' of 'torch._C._TensorBase' objects}
      144    2.704    0.019    2.704    0.019 {built-in method _winapi.WaitForSingleObject}
181024/3202    1.350    0.000   17.682    0.006 D:\ProgramData\Anaconda3\envs\env_zoo\lib\site-packages\torch\nn\modules\module.py:715(_call_impl)
    44856    1.268    0.000    1.268    0.000 {built-in method torch._C._nn.leaky_relu}
     1280    1.156    0.001    5.191    0.004 D:\ProgramData\Anaconda3\envs\env_zoo\lib\site-packages\torch\optim\functional.py:53(adam)
   232908    1.142    0.000    2.557    0.000 D:\ProgramData\Anaconda3\envs\env_zoo\lib\site-packages\torch\tensor.py:933(grad)
    66560    1.021    0.000    1.021    0.000 {method 'mul_' of 'torch._C._TensorBase' objects}
    33254    0.968    0.000    0.968    0.000 {method 'zero_' of 'torch._C._TensorBase' objects}
     1602    0.882    0.001   17.403    0.011 D:\users\Gony\FMRI_inpainting1\lib\medzoo\Unet3D_gony.py:121(forward)


--------------------------------------------------------------------------------
  autograd profiler output (CPU mode)
--------------------------------------------------------------------------------
        top 15 events sorted by cpu_time_total

------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                      Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                            aten::addcmul_        11.60%        1.068s        11.60%        1.068s        1.068s       0.000us           NaN       0.000us       0.000us             1
                       aten::instance_norm         7.91%     728.799ms         7.91%     728.799ms     728.799ms       0.000us           NaN       0.000us       0.000us             1
                          aten::batch_norm         7.91%     728.760ms         7.91%     728.760ms     728.760ms       0.000us           NaN       0.000us       0.000us             1
              aten::_batch_norm_impl_index         7.91%     728.753ms         7.91%     728.753ms     728.753ms       0.000us           NaN       0.000us       0.000us             1
                   aten::native_batch_norm         7.91%     728.738ms         7.91%     728.738ms     728.738ms       0.000us           NaN       0.000us       0.000us             1
                              ViewBackward         7.57%     697.112ms         7.57%     697.112ms     697.112ms       0.000us           NaN       0.000us       0.000us             1
                             aten::reshape         7.57%     697.101ms         7.57%     697.101ms     697.101ms       0.000us           NaN       0.000us       0.000us             1
                                aten::view         7.57%     697.098ms         7.57%     697.098ms     697.098ms       0.000us           NaN       0.000us       0.000us             1
                                 aten::div         5.13%     472.212ms         5.13%     472.212ms     472.212ms       0.000us           NaN       0.000us       0.000us             1
                              ViewBackward         4.87%     449.067ms         4.87%     449.067ms     449.067ms       0.000us           NaN       0.000us       0.000us             1
                             aten::reshape         4.87%     449.064ms         4.87%     449.064ms     449.064ms       0.000us           NaN       0.000us       0.000us             1
                                aten::view         4.87%     449.061ms         4.87%     449.061ms     449.061ms       0.000us           NaN       0.000us       0.000us             1
    struct torch::autograd::AccumulateGrad         4.78%     440.524ms         4.78%     440.524ms     440.524ms       0.000us           NaN       0.000us       0.000us             1
                                aten::add_         4.77%     439.220ms         4.77%     439.220ms     439.220ms       0.000us           NaN       0.000us       0.000us             1
    struct torch::autograd::AccumulateGrad         4.76%     438.775ms         4.76%     438.775ms     438.775ms       0.000us           NaN       0.000us       0.000us             1
------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 9.213s
CUDA time total: 0.000us

--------------------------------------------------------------------------------
  autograd profiler output (CUDA mode)
--------------------------------------------------------------------------------
        top 15 events sorted by cpu_time_total

        Because the autograd profiler uses the CUDA event API,
        the CUDA time column reports approximately max(cuda_time, cpu_time).
        Please ignore this output if your code does not use CUDA.

------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                      Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                            aten::addcdiv_        13.27%        1.224s        13.27%        1.224s        1.224s        1.224s        16.98%        1.224s        1.224s             1
                              ViewBackward         8.58%     791.255ms         8.58%     791.255ms     791.255ms     786.256ms        10.90%     786.256ms     786.256ms             1
                             aten::reshape         8.57%     791.223ms         8.57%     791.223ms     791.223ms     786.240ms        10.90%     786.240ms     786.240ms             1
                                aten::view         8.57%     791.197ms         8.57%     791.197ms     791.197ms       0.000us         0.00%       0.000us       0.000us             1
                       aten::instance_norm         8.04%     742.228ms         8.04%     742.228ms     742.228ms     742.256ms        10.29%     742.256ms     742.256ms             1
                                aten::view         8.04%     741.937ms         8.04%     741.937ms     741.937ms       0.000us         0.00%       0.000us       0.000us             1
                                aten::add_         5.36%     494.861ms         5.36%     494.861ms     494.861ms     494.864ms         6.86%     494.864ms     494.864ms             1
                              ViewBackward         5.08%     468.636ms         5.08%     468.636ms     468.636ms     466.000ms         6.46%     466.000ms     466.000ms             1
                             aten::reshape         5.08%     468.615ms         5.08%     468.615ms     468.615ms     466.000ms         6.46%     466.000ms     466.000ms             1
                                aten::view         5.08%     468.590ms         5.08%     468.590ms     468.590ms       0.000us         0.00%       0.000us       0.000us             1
    struct torch::autograd::AccumulateGrad         4.97%     459.025ms         4.97%     459.025ms     459.025ms     459.040ms         6.37%     459.040ms     459.040ms             1
                                aten::add_         4.96%     457.739ms         4.96%     457.739ms     457.739ms     457.760ms         6.35%     457.760ms     457.760ms             1
    struct torch::autograd::AccumulateGrad         4.85%     447.087ms         4.85%     447.087ms     447.087ms     447.104ms         6.20%     447.104ms     447.104ms             1
                                aten::add_         4.83%     445.955ms         4.83%     445.955ms     445.955ms     445.968ms         6.19%     445.968ms     445.968ms             1
    struct torch::autograd::AccumulateGrad         4.71%     434.607ms         4.71%     434.607ms     434.607ms     434.624ms         6.03%     434.624ms     434.624ms             1
------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 9.227s
CUDA time total: 7.210s