Diagnosing an incredibly slow backwards pass

I have a model that computes a time varying covariance matrix for time series data. The data has about 40 features per time point, with a few hundred thousand time points. The forward pass and loss calculation are quite quick, ~1 or 2 seconds, but the backwards pass takes upwards of 5 minutes for each minibatch.

I’ve tried using the profiler to diagnose the issue, but it doesn’t seem to be overly helpful. I’ve uploaded the exported profile to dropbox here.

I’m struggling to glean any useful information from this profile, could someone take a look at it and see if I’m missing anything, or whether it indicates something I’m doing fundamentally incorrectly?

You could try torch.utils.bottleneck to profile your model or did you already try it?

Running your script with cProfile
Running your script with the autograd profiler...
  Environment Summary
PyTorch 0.5.0a0+d9aeb7e compiled w/ CUDA 8.0
Running with Python 3.6 and CUDA 8.0.61

`pip list` truncated output:
numpy (1.14.3)
torch (0.5.0a0+d9aeb7e)
torchvision (0.1.9)
  cProfile output
         313970 function calls (309485 primitive calls) in 204.919 seconds

   Ordered by: internal time
   List reduced from 2145 to 15 due to restriction <15>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1  200.084  200.084  200.084  200.084 {method 'run_backward' of 'torch._C._EngineBase' objects}
        7    3.308    0.473    3.309    0.473 /Users/askates/anaconda3/envs/torch_gpu/lib/python3.6/site-packages/h5py/_hl/dataset.py:634(read_direct)
     2176    0.253    0.000    3.564    0.002 {built-in method numpy.core.multiarray.array}
     5000    0.190    0.000    0.190    0.000 {built-in method inverse}
    39/26    0.101    0.003    0.124    0.005 {built-in method _imp.create_dynamic}
      315    0.086    0.000    0.086    0.000 {method 'astype' of 'numpy.ndarray' objects}
        2    0.075    0.038    0.075    0.038 {built-in method stack}
     5000    0.070    0.000    0.070    0.000 {method 'potrf' of 'torch._C._TensorBase' objects}
     2000    0.053    0.000    0.053    0.000 {built-in method addmm}
      160    0.050    0.000    0.050    0.000 {built-in method marshal.loads}
     1000    0.047    0.000    0.149    0.000 /Users/askates/anaconda3/envs/torch_gpu/lib/python3.6/site-packages/torch/nn/_functions/rnn.py:48(GRUCell)
        1    0.038    0.038    0.271    0.271 /Users/askates/Documents/Projects/rvae_pytorch/rvae/svmodel.py:63(forward)
     1178    0.037    0.000    0.037    0.000 {built-in method posix.stat}
    10000    0.036    0.000    0.036    0.000 /Users/askates/anaconda3/envs/torch_gpu/lib/python3.6/site-packages/torch/tensor.py:361(<lambda>)
        1    0.031    0.031    0.304    0.304 /Users/askates/anaconda3/envs/torch_gpu/lib/python3.6/site-packages/torch/distributions/multivariate_normal.py:67(_batch_mahalanobis)

  autograd profiler output (CPU mode)
        top 15 events sorted by cpu_time_total

----------------  ---------------  ---------------  ---------------  ---------------  ---------------
Name                     CPU time        CUDA time            Calls        CPU total       CUDA total
----------------  ---------------  ---------------  ---------------  ---------------  ---------------
stack                 40637.858us          0.000us                1      40637.858us          0.000us
stack                 33870.093us          0.000us                1      33870.093us          0.000us
mul                   27187.788us          0.000us                1      27187.788us          0.000us
BmmBackward           25134.970us          0.000us                1      25134.970us          0.000us
cat                   21868.530us          0.000us                1      21868.530us          0.000us
add                   20842.460us          0.000us                1      20842.460us          0.000us
add                   20106.253us          0.000us                1      20106.253us          0.000us
MulBackward1          19402.615us          0.000us                1      19402.615us          0.000us
mul                   19392.639us          0.000us                1      19392.639us          0.000us
cat                   19206.579us          0.000us                1      19206.579us          0.000us
add                   18342.984us          0.000us                1      18342.984us          0.000us
add                   16738.660us          0.000us                1      16738.660us          0.000us
add                   16615.992us          0.000us                1      16615.992us          0.000us
add                   16263.219us          0.000us                1      16263.219us          0.000us
add                   16205.741us          0.000us                1      16205.741us          0.000us

  autograd profiler output (CUDA mode)
        top 15 events sorted by cpu_time_total

	Because the autograd profiler uses the CUDA event API,
	the CUDA time column reports approximately max(cuda_time, cpu_time).
	Please ignore this output if your code does not use CUDA.

---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Name                    CPU time        CUDA time            Calls        CPU total       CUDA total
---------------  ---------------  ---------------  ---------------  ---------------  ---------------
stack                89857.817us      82362.793us                1      89857.817us      82362.793us
stack                69146.335us      69145.508us                1      69146.335us      69145.508us
CatBackward          59849.753us      59853.027us                1      59849.753us      59853.027us
CatBackward          51944.333us      54468.750us                1      51944.333us      54468.750us
cat                  38664.195us      34831.055us                1      38664.195us      34831.055us
mul                  31829.888us      31864.258us                1      31829.888us      31864.258us
matmul               29713.333us      29652.344us                1      29713.333us      29652.344us
BmmBackward          26958.807us      26968.750us                1      26958.807us      26968.750us
bmm                  25837.496us      24409.668us                1      25837.496us      24409.668us
add                  25265.362us      34644.531us                1      25265.362us      34644.531us
cat                  25188.573us      25207.520us                1      25188.573us      25207.520us
add                  24454.340us      24490.723us                1      24454.340us      24490.723us
add                  22953.946us      22984.375us                1      22953.946us      22984.375us
add                  22696.327us      22718.750us                1      22696.327us      22718.750us
add                  21764.742us      22125.000us                1      21764.742us      22125.000us

I’m not sure if this sheds any light either. Edit: Also, as I am currently testing this on my macbook, I am not using a GPU for this.

In your training code, do you wrap everything that does not need to be included in the computation graph with with torch.no_grad()? I experienced a similar issue and it was because my evaluation code was being included in the comp. graph .

All variables that don’t need a gradient have the flag requires_grad set to False, and I call the optimiser with the call torch.optim.Adam(filter(lambda p: p.requires_grad, net.parameters()), lr)

I’ve gone back and had a look at how the number of spatial dimensions affects the time taken to do the backwards pass, and it seems to be quadratic, which I suppose makes sense as a covariance matrix is a DxD matrix. The question is, does taking 200x longer to do a backwards pass than a forwards pass make sense?

I have uploaded a full copy of the profiler text here: https://www.dropbox.com/s/6vh73ihlclevq9s/Profile.txt?dl=0

The top 10 longest are:

add	25106.022us	0.000us	1	25106.022us	0.000us
add	22598.556us	0.000us	1	22598.556us	0.000us
BmmBackward	22167.860us	0.000us	1	22167.860us	0.000us
add	21988.583us	0.000us	1	21988.583us	0.000us
add	21765.380us	0.000us	1	21765.380us	0.000us
add	21736.503us	0.000us	1	21736.503us	0.000us
add	21113.028us	0.000us	1	21113.028us	0.000us
add	20385.941us	0.000us	1	20385.941us	0.000us
add	20219.376us	0.000us	1	20219.376us	0.000us
add	20114.464us	0.000us	1	20114.464us	0.000us

Is there any way of debugging the backwards pass? If I try and do it through PyCharm for example, I get to Variable._execution_engine.run_backward before I can no longer step through it – I assume that this part is in C? Is there any way I can look at the operations mentioned in the profiler specifically?

I’m also wondering if the time taken in the backwards pass is simply due to how many operations take place; running the profiler on only the backwards pass gives over 300,000 operations in the profiler. Printing the contents of the profiler alone takes ages…

Could anyone throw out some suggestions? Now I’ve got it running on a cluster with GPUs which is faster but still quite slow, on the order of ~1 minute 20 per minibatch, which is around 1 hour per epoch. It would be nice to run quicker than a week or two!

Happy to post my code if it would help.

1 Like