How to reduce autograd memory usage?

Are you using the nightly build by any chance? If not can you try again with the nightly build? We fixed a bug with notebooks very recently that was preventing the stack trace from showing up there.

Otherwise, if this error wasn’t happening without the checkpoint, you can double check that you don’t have any state in the model that changes between the first evaluation and the second.
If you leave only one checkpoint in your net, you can make sure that the two evaluations (one during the forward and one during the backward) both return exactly the same thing.

I wasn’t using the nightly version, so I switched to that but it still wouldn’t give me the traceback. I finally tried running it from the anaconda prompt and it gave me this:

[W ..\torch\csrc\autograd\python_anomaly_mode.cpp:60] Warning: Error detected in CheckpointFunctionBackward. Traceback of forward call that caused the error:
  File "estimations_new2.py", line 200, in <module>
    take_step=take_step, minimizer_kwargs=arg)
  File "C:\Users\Peter\Anaconda3\lib\site-packages\scipy\optimize\_basinhopping.py", line 679, in basinhopping
    accept_tests, disp=disp)
  File "C:\Users\Peter\Anaconda3\lib\site-packages\scipy\optimize\_basinhopping.py", line 72, in __init__
    minres = minimizer(self.x)
  File "C:\Users\Peter\Anaconda3\lib\site-packages\scipy\optimize\_basinhopping.py", line 284, in __call__
    return self.minimizer(self.func, x0, **self.kwargs)
  File "C:\Users\Peter\Anaconda3\lib\site-packages\scipy\optimize\_minimize.py", line 626, in minimize
    constraints, callback=callback, **options)
  File "C:\Users\Peter\Anaconda3\lib\site-packages\scipy\optimize\slsqp.py", line 370, in _minimize_slsqp
    bounds=new_bounds)
  File "C:\Users\Peter\Anaconda3\lib\site-packages\scipy\optimize\optimize.py", line 262, in _prepare_scalar_function
    finite_diff_rel_step, bounds, epsilon=epsilon)
  File "C:\Users\Peter\Anaconda3\lib\site-packages\scipy\optimize\_differentiable_functions.py", line 76, in __init__
    self._update_fun()
  File "C:\Users\Peter\Anaconda3\lib\site-packages\scipy\optimize\_differentiable_functions.py", line 166, in _update_fun
    self._update_fun_impl()
  File "C:\Users\Peter\Anaconda3\lib\site-packages\scipy\optimize\_differentiable_functions.py", line 73, in update_fun
    self.f = fun_wrapped(self.x)
  File "C:\Users\Peter\Anaconda3\lib\site-packages\scipy\optimize\_differentiable_functions.py", line 70, in fun_wrapped
    return fun(x, *args)
  File "C:\Users\Peter\Anaconda3\lib\site-packages\scipy\optimize\optimize.py", line 74, in __call__
    self._compute_if_needed(x, *args)
  File "C:\Users\Peter\Anaconda3\lib\site-packages\scipy\optimize\optimize.py", line 68, in _compute_if_needed
    fg = self.fun(x, *args)
  File "C:\Users\Peter\Desktop\JMP\Analysis\Original_estimation_3-22-19\myLib_new2.py", line 115, in fun
    ll = ll + logL(x, theta, model, data[d])
  File "C:\Users\Peter\Desktop\JMP\Analysis\Original_estimation_3-22-19\myLib_new2.py", line 54, in logL
    out = checkpoint.checkpoint(fn, x)
  File "C:\Users\Peter\Anaconda3\lib\site-packages\torch\utils\checkpoint.py", line 163, in checkpoint
    return CheckpointFunction.apply(function, preserve, *args)
 (function print_stack)

I also discovered that I don’t encounter the error if I only conduct my estimation on one of the two data sets I’m using, which has the effect of making the for loop indexed by d in my original post run only once instead of twice.

Ho so the issue does happen inside the second forward done by the checkpoint most likely.
Can you add some print in there to make sure you did not modify some state from one iteration of the loop to the next.
Keep in mind that list/dict captured in fn will be changed if you modify them inplace after the first evaluation.

Ok, so I was able to determine that the error is coming from the variables S and K changing across iterations of the for loop. These variables are integers that determine the dimensions of several tensors, hence the broadcasting error. I set these variables to be the same across loops and the program runs to completion. The only problem now is that the gradient is incorrectly evaluated to always be a vector of 0’s. I’m thinking of restructuring the program so that I don’t need this for loop in the first place, as I now realize there are a host of other variables that could potentially change across iterations of the loop. Hopefully this will fix the issue with gradient, but even if it doesn’t, its probably the way I should have been structuring the program in the first place. Once the program is working properly, I will finally be able compare its new memory consumption to its consumption without the checkpoints.

1 Like

After doing some more testing, I learned that the error

RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.

only appears when the model attributes arg, p_ak, U, S, and S_q are not manually detached . If I detach nearly any of these attributes, however, the returned gradient is a vector of zeros. When these attributes are not detached, the gradient is calculated correctly. Any idea why this might be happening? I’m really at a loss for what to do next. Thank you again for the help you’ve given me up to this point.

Maybe it will be simpler if you can make a small repro that uses your network and random data that I can run locally on my machine?

I pm’d you modified version of the code to take a look at.