What's the best way to call backward() several times?

wjmaddox · November 18, 2017, 4:56pm

I’m writing code (specifically an implementation of Hamiltonian Monte Carlo) that calls backward() on a function several times at each iteration. Profiling this code shows that almost the entire run time is devoted to repeated calls to ‘backward()’.

As this function never changes, is there a way to store the gradient itself as a function in order to reduce the calls to backward, or is there a way to speed up these repeated calls to backward()?

A very minimal example of the code I’m writing is shown below:

import torch
import torch.autograd as autograd
from torch.autograd import Variable

y = Variable(torch.Tensor([0.1, 0.1]), requires_grad = True)

def myenergy(q):
    sigmainv = Variable(torch.Tensor([[10.25, -9.74], [-9.74, 10.25]]))
    #corresponds approximately to rho = 0.95
    return 0.5 * q.matmul(sigmainv).dot(q) 

def HMC_basic(pos, energy, T = 10000, n_steps = 10, stepsize = 0.25):
    #not quite a correct implementation of HMC
    for t in range(T):
        vel = Variable(torch.randn(pos.size()))
        
        for i in range(n_steps):
            pos.data.add_(stepsize*vel.data)
            if pos.grad is not None:
                pos.grad.data.zero_()
            if i is not n_steps - 1:
                energy(pos).backward()
                vel -= stepsize * pos.grad
    
    return pos

#with torch.autograd.profiler.profile() as prof:
out = HMC_basic(y, myenergy)
    
#print(prof.key_averages())

Running this code on a CPU and profiling with cProfile and snakeviz tells me that calls to backward took up ~60% (15s) of the total run time of this code (~20s).

In this example, calculating the gradient explicitly is simple, but in most cases it’s not.

richard · November 18, 2017, 5:07pm

If you know the gradient doesn’t change, you could compute it once, store it in a variable, and reuse it elsewhere:

grad = energy(pos).backward()
vel -= stepsize * grad

If you wanted to memoize the calls to backward, one thing you could do is write a function that computes the gradient you want, something like:

prev_computations = {}  # maps pos -> gradient
def energy_backwards(pos):
    if pos in prev_computations:
        return prev_computations[pos]
    #  Otherwise, compute the gradient analytically
    ...
    return grad

wjmaddox · November 18, 2017, 5:39pm

Thanks for the quick response.

With regards to your first suggestion, I’m a bit confused as my pos variable should be changing at every time step and at every step update. Thus, the gradient of pos with respect to energy should be changing as well. Using pos.grad after only calling the backwards step once doesn’t update the gradient at every step.

Memo-izing the gradients should help speed up the gradient calculation when I’m using a MH step in HMC, and I can try that. However, in this specific example, it may not help as my position shouldn’t ever be the same (I’m effectively drawing random samples from the energy probability distribution).