How does autograd average across a minibatch?

miturian · September 11, 2021, 8:44pm

I apologize if this is adequately explained in the documention. If so, please just give me a link…

I have only found

https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html

neither of which seem explain this.

However, question is:
The gradient in a given portion of the computational graph depends not only on the loss and the parameter values, but also on the input. If this is true, then even though the loss is summed over the minibatch outputs before backpropagation, how does the optimizer calculate the average gradient? Are all batch_size number of activations for each neuron saved during the forward step, or how?

I hope the question makes sense.

KFrank · September 11, 2021, 9:27pm

Hi Miturian!

It depends entirely on how the “loss” on which you call loss.backward()
is calculated.

Autograd simply computes the gradient of on whatever scalar you called
.backward().

Most pytorch loss functions calculate the average across the minibatch
of the per-sample losses, but they often given you the option to
compute the 'sum' or apply sample weightings.

You could, for example, have a minibatch of ten samples, but only
compute the loss for the first sample, ignoring the other nine. If you
did that, you would only get the gradient of that first sample.

And, just to be clear, the optimizer does not calculate the gradients,
averaged or otherwise. By the time you call opt.step() the gradients
have already been calculated (when you called loss.backward()).

Best.

K. Frank

miturian · September 11, 2021, 9:33pm

thanks for replying

ok, so, assume that my loss is the average over the individual losses in the minibatch. How does .backward() turn that into an average gradient? If my network was simply the operation “1/x”, with 10 different values of x in my minibatch, then presumably the gradient would have 10 different values, even though I was backpropagating an average loss?

KFrank · September 12, 2021, 4:31am

Hi Miturian!

First some contextual comments:

Pytorch’s autograd operates on tensor computations that produce
a scalar. (Autograd can manage things slightly more general than
just a scalar result, but let’s leave that aside for this discussion.)

Leaf tensors are those that are inputs to the overall computation,
and can have their requires_grad property set to True or False.
Autograd computes the gradient of the final scalar result with respect
to (the elements of) the requires_grad = True leaf tensors.

Autograd neither knows nor cares that your computation might be
implemented as a pytorch Module that we conceptually understand
as a model, nor that one of its requires_grad = False leaf tensors
is, in the typical use case, the input to your model and represents a
batch of samples. It neither knows nor cares that the weights (and
other Parameterss) of your model are its requires_grad = True
leaf tensors with respect to which it will calculate gradients.

This structure of applying a model with weights to a batch of samples
and then computing a scalar loss function for the output of the model
is higher-level structure that is distinct from autograd’s lower-level
functionality of computing gradients of tensor operations that may
or may not fit that higher-level structure.

(Autograd also neither knows nor cares that the gradients it computes
will typically be used by an optimizer to train the weights of a model.)

Having gotten all of that out of the way …

At a mathematical level, the average of the gradients is the gradient
of the average. That is, taking the gradient is a linear operation.

But the more computationally mechanistic answer is that the average
is computed by summing over a number of items and then multiplying
that sum by one over the number of items. Autograd tracks the
“computation graph” of these operations and applies the chain rule to
them numerically. The gradient (in this case, the simple derivative) of
constant times x (with respect to x) is just the constant. Upstream of
that in the computation graph is that the gradient of sum of x_i (with
respect to the x_i) is just one for each of the x_i.

This example doesn’t really have the structure of a the model with
weights applied to a batch of input samples. (In your case, it sounds
like you want the gradient with respect to the “input minibatch.” That’s
fine, but its not the typical use case.)

So let me phrase it a bit differently:

I have a 1d tensor, T, of shape [10] that consists of ten different values
of x (that I will consider a minibatch). I have a tensor operation that
computes the (scalar) average of 1 / x for that set of ten xs. We won’t
call this a “model” because it doesn’t really fit the typical model use case.
This scalar-valued tensor operation has a single leaf tensor, T, and it
has requires_grad = True. Let’s use autograd to compute the gradient
of the scalar result with respect to T:

>>> import torch
>>> torch.__version__
'1.9.0'
>>> T = torch.arange (10.) + 1.
>>> T.requires_grad = True
>>> T
tensor([ 1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10.], requires_grad=True)
>>> scalar_result = (1 / T).mean()
>>> scalar_result
tensor(0.2929, grad_fn=<MeanBackward0>)
>>> scalar_result.backward()
>>> T.grad
tensor([-0.1000, -0.0250, -0.0111, -0.0063, -0.0040, -0.0028, -0.0020, -0.0016,
        -0.0012, -0.0010])

The “magic” happens because the pytorch developers have implemented
a companion .backward() function for each (differentiable) pytorch
tensor operation that computes its gradient. Autograd tracks how
a series of such tensor operations are composed together in its
“computation graph” and uses the chain rule to link together the
individual .backward() functions to compute the gradient of the
final scalar result with respect to the requires_grad = True leaf
tensors.

Best.

K. Frank

miturian · September 12, 2021, 3:38pm

Dear KFrank

Thank you very much for your detailed answers. I think I am getting close to understanding it, but will make one more question, to which the answer is hopefully ‘yes, that’s it’. If so, I’ll update this thread with a description of my own misunderstanding, so perhaps others making the same mistake can get some extra value from this thread.

Anyway:

I agree that my ‘1/x’ example was poorly chosen. I was simply grasping for a function with a non-constant gradient.

Let’s assume a very simple network:

y=x*w

with the loss being avr(y)

Let’s assume w=2, x=[1,2] (we have a batchSize of 2, input is a scalar)

In the forward pass, we get y=[2,4], L=.5*(2+4)=3

By the chain rule:

dL/dw=dL/dy0*dy0/dw+dL/dy1*dy1/dw

We see that there is a “batchSize” number of terms in this sum. As I understand you, all these factors are stored in the ‘graph’ until .backward() is called, is that correct? I was assuming that there would be averaging or something else going on as each layer was called in the network, but that was incorrect? Since autograd doesn’t know anything about mini batches, does that mean that x=1 and x=2 create two almost equal ‘arms’ in the graph, with a common root at L? If that picture makes sense.

So, the graph contains ALL computations that took place, even if a lot of them are linked to the same variables.

As in, I was expecting the graph to somehow make a rewriting like:

dL/dy0*dy0/dw+dL/dy1*dy1/dw=dL/dy0*(dy0/dw+dy1/dw)

to take advantage of symmetries, but from your ‘neither knows nor cares’ phrasing, I assume that autograd does no such thing, and instead keeps track of everything?

Is there a way to inspect this graph somehow? Or at least see its size? I am supposed to be including some of this in a lecture, and I think some way of visualizing how much is actually being saved in the graph would be instructive.

Regards

KFrank · September 13, 2021, 2:58am

Hi Miturian!

I think you’re trying to say something slightly different, so let me
rephrase it:

x is a batchSize-2 “input” to the “model”; w is the (scalar) weight
that defines this very simple “model”.

Yes.

Yes, but I think it’s more helpful to think of this as a “tensor” chain
rule, rather than a chain rule broken down into scalar pieces.

Thus:

dL / dw = sum_i { dL / dy_i * dy_i / dw },

where sum_i {... * ...} is a tensor contraction.

That is:

>>> import torch
>>> torch.__version__
'1.9.0'
>>>
>>> w = torch.tensor ([2.], requires_grad = True)  # "requires_grad = True" leaf tensor
>>> x = torch.tensor ([1., 2.])                    # "requires_grad = False" leaf tensor
>>> y = torch.mul (x, w)                           # non-leaf tensor (same as x * w)
>>> L = y.mean()                                   # root (non-leaf) final result
>>>
>>> y.retain_grad()                                # so we can examine y.grad
>>> L.retain_grad()                                # so we can examine L.grad
>>>
>>> w
tensor([2.], requires_grad=True)
>>> x
tensor([1., 2.])
>>> y
tensor([2., 4.], grad_fn=<MulBackward0>)
>>> L
tensor(3., grad_fn=<MeanBackward0>)
>>>
>>> L.backward (retain_graph = True)
>>>
>>> L.grad   # unit "seed" to start chain rule
tensor(1.)
>>> y.grad   # [dL / dy_0, dL / dy_1]
tensor([0.5000, 0.5000])
>>> x.grad   # None, because x is a leaf with requires_grad = False
>>> w.grad   # [dL / dw] = sum_i {dL / dy_i * dy_i / dw}
tensor([1.5000])

If I understand you correctly that “these factors” are things like dy_i / dw,
then, no, they’re are not stored in the graph, waiting for .backward() to
be called.

Instead, when y = x * w is called (in the “forward pass”), only the fact
that * (torch.mul()) was called is stored – by using a MulBackward0
object – together with any torch.mul()-specific context that mul()'s
companion backward() function will need to calculate dy_i / dw at
the time that the backward pass is run. To emphasize, dy_i / dw is
neither computed nor stored during the forward pass (but information
sufficient to compute it during the backward pass is stored).

I prefer to say that y = x * w creates a single arm, but that the arm in
question is a “tensor arm.” This is not purely semantic, in that autograd
stores a single “tensor arm” for this operations, rather than two separate
“scalar arms.”

Just to be clear, the graph only contains those computations that
lead back to a requires_grad = True leaf tensor. It doesn’t
waste time or storage on computations that lead back solely to
requires_grad = False leaf tensors.

If by this you mean that a lot of scalar computations are “linked to
the same variables” because those scalar computations are part
of the same tensor computation, the graph, in a sense, contains
all of those scalar computations, but in an efficient way, because
those scalar computations are packaged together as a single tensor
operation.

If by “symmetries” you mean that the several (in this example, two)
scalar operations are related to one another because they are part
of the same tensor operation, then, yes, autograd takes advantage
of this structure – and does so, as described above, by storing these
several scalar operations together as a single tensor operation.

I don’t know how to, but I think I’ve seen posts that talk about probing
the details of the graph. (I would have to imaging that the pytorch api
supports this, even if it isn’t well-documented or a part of what we
think of as the “public-facing” api, but, again, I don’t know how to do it.)

Best.

K. Frank

miturian · September 13, 2021, 6:48am

Thank you, you have done a heroic job of answering