 # How does autograd average across a minibatch?

I apologize if this is adequately explained in the documention. If so, please just give me a link…

I have only found

neither of which seem explain this.

However, question is:
The gradient in a given portion of the computational graph depends not only on the loss and the parameter values, but also on the input. If this is true, then even though the loss is summed over the minibatch outputs before backpropagation, how does the optimizer calculate the average gradient? Are all batch_size number of activations for each neuron saved during the forward step, or how?

I hope the question makes sense.

Hi Miturian!

It depends entirely on how the “loss” on which you call `loss.backward()`
is calculated.

`.backward()`.

Most pytorch loss functions calculate the average across the minibatch
of the per-sample losses, but they often given you the option to
compute the `'sum'` or apply sample weightings.

You could, for example, have a minibatch of ten samples, but only
compute the loss for the first sample, ignoring the other nine. If you
did that, you would only get the gradient of that first sample.

And, just to be clear, the optimizer does not calculate the gradients,
averaged or otherwise. By the time you call `opt.step()` the gradients
have already been calculated (when you called `loss.backward()`).

Best.

K. Frank

thanks for replying ok, so, assume that my loss is the average over the individual losses in the minibatch. How does .backward() turn that into an average gradient? If my network was simply the operation “1/x”, with 10 different values of x in my minibatch, then presumably the gradient would have 10 different values, even though I was backpropagating an average loss?

Hi Miturian!

Pytorch’s autograd operates on tensor computations that produce
a scalar. (Autograd can manage things slightly more general than
just a scalar result, but let’s leave that aside for this discussion.)

Leaf tensors are those that are inputs to the overall computation,
and can have their `requires_grad` property set to `True` or `False`.
to (the elements of) the `requires_grad = True` leaf tensors.

implemented as a pytorch `Module` that we conceptually understand
as a model, nor that one of its `requires_grad = False` leaf tensors
is, in the typical use case, the input to your model and represents a
batch of samples. It neither knows nor cares that the `weight`s (and
other `Parameters`s) of your model are its `requires_grad = True`
leaf tensors with respect to which it will calculate gradients.

This structure of applying a model with weights to a batch of samples
and then computing a scalar loss function for the output of the model
is higher-level structure that is distinct from autograd’s lower-level
functionality of computing gradients of tensor operations that may
or may not fit that higher-level structure.

will typically be used by an optimizer to train the weights of a model.)

Having gotten all of that out of the way …

At a mathematical level, the average of the gradients is the gradient
of the average. That is, taking the gradient is a linear operation.

But the more computationally mechanistic answer is that the average
is computed by summing over a number of items and then multiplying
that sum by one over the number of items. Autograd tracks the
“computation graph” of these operations and applies the chain rule to
them numerically. The gradient (in this case, the simple derivative) of
constant times x (with respect to x) is just the constant. Upstream of
that in the computation graph is that the gradient of sum of x_i (with
respect to the x_i) is just one for each of the x_i.

This example doesn’t really have the structure of a the model with
weights applied to a batch of input samples. (In your case, it sounds
like you want the gradient with respect to the “input minibatch.” That’s
fine, but its not the typical use case.)

So let me phrase it a bit differently:

I have a 1d tensor, `T`, of shape `` that consists of ten different values
of `x` (that I will consider a minibatch). I have a tensor operation that
computes the (scalar) average of `1 / x` for that set of ten `x`s. We won’t
call this a “model” because it doesn’t really fit the typical model use case.
This scalar-valued tensor operation has a single leaf tensor, `T`, and it
has `requires_grad = True`. Let’s use autograd to compute the gradient
of the scalar result with respect to `T`:

``````>>> import torch
>>> torch.__version__
'1.9.0'
>>> T = torch.arange (10.) + 1.
>>> T
tensor([ 1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10.], requires_grad=True)
>>> scalar_result = (1 / T).mean()
>>> scalar_result
>>> scalar_result.backward()
tensor([-0.1000, -0.0250, -0.0111, -0.0063, -0.0040, -0.0028, -0.0020, -0.0016,
-0.0012, -0.0010])
``````

The “magic” happens because the pytorch developers have implemented
a companion `.backward()` function for each (differentiable) pytorch
a series of such tensor operations are composed together in its
“computation graph” and uses the chain rule to link together the
individual `.backward()` functions to compute the gradient of the
final scalar result with respect to the `requires_grad = True` leaf
tensors.

Best.

K. Frank

1 Like

Dear KFrank

Thank you very much for your detailed answers. I think I am getting close to understanding it, but will make one more question, to which the answer is hopefully ‘yes, that’s it’. If so, I’ll update this thread with a description of my own misunderstanding, so perhaps others making the same mistake can get some extra value from this thread.

Anyway:

I agree that my ‘1/x’ example was poorly chosen. I was simply grasping for a function with a non-constant gradient.

Let’s assume a very simple network:

y=x*w

with the loss being avr(y)

Let’s assume w=2, x=[1,2] (we have a batchSize of 2, input is a scalar)

In the forward pass, we get y=[2,4], L=.5*(2+4)=3

By the chain rule:

`dL/dw=dL/dy0*dy0/dw+dL/dy1*dy1/dw`

We see that there is a “batchSize” number of terms in this sum. As I understand you, all these factors are stored in the ‘graph’ until .backward() is called, is that correct? I was assuming that there would be averaging or something else going on as each layer was called in the network, but that was incorrect? Since autograd doesn’t know anything about mini batches, does that mean that x=1 and x=2 create two almost equal ‘arms’ in the graph, with a common root at L? If that picture makes sense.

So, the graph contains ALL computations that took place, even if a lot of them are linked to the same variables.

As in, I was expecting the graph to somehow make a rewriting like:

`dL/dy0*dy0/dw+dL/dy1*dy1/dw=dL/dy0*(dy0/dw+dy1/dw)`

to take advantage of symmetries, but from your ‘neither knows nor cares’ phrasing, I assume that autograd does no such thing, and instead keeps track of everything?

Is there a way to inspect this graph somehow? Or at least see its size? I am supposed to be including some of this in a lecture, and I think some way of visualizing how much is actually being saved in the graph would be instructive.

Regards

Hi Miturian!

I think you’re trying to say something slightly different, so let me
rephrase it:

`x` is a batchSize-2 “input” to the “model”; `w` is the (scalar) weight
that defines this very simple “model”.

Yes.

Yes, but I think it’s more helpful to think of this as a “tensor” chain
rule, rather than a chain rule broken down into scalar pieces.

Thus:

`dL / dw = sum_i { dL / dy_i * dy_i / dw }`,

where `sum_i {... * ...}` is a tensor contraction.

That is:

``````>>> import torch
>>> torch.__version__
'1.9.0'
>>>
>>> w = torch.tensor ([2.], requires_grad = True)  # "requires_grad = True" leaf tensor
>>> x = torch.tensor ([1., 2.])                    # "requires_grad = False" leaf tensor
>>> y = torch.mul (x, w)                           # non-leaf tensor (same as x * w)
>>> L = y.mean()                                   # root (non-leaf) final result
>>>
>>>
>>> w
>>> x
tensor([1., 2.])
>>> y
>>> L
>>>
>>> L.backward (retain_graph = True)
>>>
>>> L.grad   # unit "seed" to start chain rule
tensor(1.)
>>> y.grad   # [dL / dy_0, dL / dy_1]
tensor([0.5000, 0.5000])
>>> x.grad   # None, because x is a leaf with requires_grad = False
>>> w.grad   # [dL / dw] = sum_i {dL / dy_i * dy_i / dw}
tensor([1.5000])
``````

If I understand you correctly that “these factors” are things like `dy_i / dw`,
then, no, they’re are not stored in the graph, waiting for `.backward()` to
be called.

Instead, when `y = x * w` is called (in the “forward pass”), only the fact
that `*` (`torch.mul()`) was called is stored – by using a `MulBackward0`
object – together with any `torch.mul()`-specific context that `mul()`'s
companion `backward()` function will need to calculate `dy_i / dw` at
the time that the backward pass is run.
To emphasize, `dy_i / dw` is
neither computed nor stored during the forward pass (but information
sufficient to compute it during the backward pass is stored).

I prefer to say that `y = x * w` creates a single arm, but that the arm in
question is a “tensor arm.” This is not purely semantic, in that autograd
stores a single “tensor arm” for this operations, rather than two separate
“scalar arms.”

Just to be clear, the graph only contains those computations that
lead back to a `requires_grad = True` leaf tensor. It doesn’t
waste time or storage on computations that lead back solely to
`requires_grad = False` leaf tensors.

If by this you mean that a lot of scalar computations are “linked to
the same variables” because those scalar computations are part
of the same tensor computation, the graph, in a sense, contains
all of those scalar computations, but in an efficient way, because
those scalar computations are packaged together as a single tensor
operation.

If by “symmetries” you mean that the several (in this example, two)
scalar operations are related to one another because they are part
of this structure – and does so, as described above, by storing these
several scalar operations together as a single tensor operation.

I don’t know how to, but I think I’ve seen posts that talk about probing
the details of the graph. (I would have to imaging that the pytorch api
supports this, even if it isn’t well-documented or a part of what we
think of as the “public-facing” api, but, again, I don’t know how to do it.)

Best.

K. Frank

1 Like

Thank you, you have done a heroic job of answering

1 Like