Minibatch and efficient gradient accumulation

tonyr · November 11, 2024, 3:10pm

The basic view of autograd is that Parameters are held as leaf variables and can be combined in lots of wonderful and weird ways. We call backward() on a result and the gradients are computed all the way back to every leaf variable.

However, let’s say that I want to use a function of some parameters, f(\theta) in each pass of my minibatch. f() is expensive, and f(\theta) won’t change until I step the optimiser. So I’d like to think that I could say y = f(\theta) after zero_grad and before all passes and then use y within the minibatch loop. The idea is to accumulate gradients on y in the minibatch and not on the leaf variable, \theta.

Is autograd smart enough to see that y is constant in the minibatch loop, or do I need to have two optimisers, one over what changes in the minibatch and one to optimise f(\theta).

Where do I look to learn more?

Example code below just to illustrate what I’m talking about

import torch

ninp = 8
nout = 16
nmini = 8
theta =	torch.nn.Parameter(torch.zeros(ninp))
f = torch.nn.Linear(ninp, nout)

while True:

  # zero_grad()                                                                                                    

  y = f(theta)
  vec = torch.ones(nout)
  for _ in range(nmini):
    vec = y * vec

  loss = vec.sum()
  loss.backward()

  # step optimiser

ptrblck · November 11, 2024, 10:33pm

y is not constant as you are re-assigning the result of y * vec to it while the multiplication is differentiable. Autograd will thus create a computation graph with these multiplications.

tonyr · November 12, 2024, 7:18am

Thanks @ptrblck, you are right, I’d screwed up my example that was supposed to illustrate the problem. I’ve now edited the code to swap vec and y in the loop, so y is a constant in the loop. I’ve also edited it to start with a basic view of autograd as an introduction to the problem.

The basic view is as you say, y is a function of my parameters and so isn’t a constant. However, the question is specifically within a mini-batch setting, that is I’m computing y outside of the minibatch then iterating over my inputs/computation within the minibatch. Within this setting, y is a constant within the mini-batch loop and there is a gain to be made in computations by accumulating the gradients at y and backproping to \theta only once.

I’ve only used the mini-batch scenario for clarity as it shows that backward() is called many times but there is a point in the computation graph where gradients could be accumulated and one call of backward() made from there on. Maybe it would have been clearer if I had said:

 y = f(theta)
 g(y).backward()
 h(y).backward()

where I’d like to call f(theta).backward() only once. Or there again this might have ended up in another (related) discussion about backproping through the graph twice…

Of course I’m working on it as I type. Now I’ve got this far, I’m pretty sure that autograd can not make this optimisation. It can only fire when backward() is called, so there is no ‘end of loop’ call for it to complete on. So what I’m looking for is a way to introduce a manual graph break so that y looks like a Parameter within the loop and then make one more call using those gradients. Something like:

   y = Parameter(f(theta).detach())
   for _ in range(...)
     # use y
     loss.backward(retain_graph=True)
     f.weight.grad += y.grad
   
  # some optimiser call that uses f.weight.grad

This seems like something we’d like to do quite a lot in big models, I’m just missing the right search terms to find the answer.