How does Pytorch Backward functions handle batched inputs?

Hi, I am curious as to how Pytorch’s backward function handles multiple inputs.

Specifically, are the losses averaged across inputs in the final layer itself (cross entropy loss, etc.), or are the input specific loss matrixes passed on to previous layers and each layer averages/sums across these losses?

For ex. For a Input Layer(p neurons)->Linear(q neurons)->CrossEntropy NN, and for a batched input of size “b”, does the backward differential that reaches the Linear layer have a shape of “q” or “bxq”?

The reason for the question is because the implementation of the latter is simpler for the linear layer (dW is a simple outer product).

Thanks.

Hi Siddhanth!

Yes, in the simplest case the loss function is averaged over the batch dimension.

In a typical network architecture, the leading dimension of the input tensor is a
batch dimension and this batch dimension is passed through each layer of the
network so that the final output of the network has the same leading batch dimension.
(It doesn’t have to be this way, but this is typical.)

The loss function then averages loss values over the output’s batch dimension (as
well as possibly averaging over other things, such as pixels in a output image). This
is done by default by many built-in pytorch loss functions which are given the default
setting of reduction = 'mean' when they are instantiated.

Not exactly. Autograd passes the gradient of a single scalar loss function with
respect to the output of some given intermediate layer back to that layer. Because
the output of that intermediate layer (typically) has a batch dimension, that gradient
will have a batch dimension.

Let’s say that the intermediate layer you are calling “Linear(q neurons)” is
Linear (in_features = p, out_features = q) and its input is a tensor of
shape [b, p] (where b is the batch size). Then its output will have shape [b, q].
When backpropagation reaches that layer, it will pass a gradient (of the scalar loss
value) of shape [b, q] to that layer. Autograd then computes the jacobian-vector
product of the jacobian of the transformation performed by the layer with the “vector”
that is the gradient that has been passed to the layer (as computed by the previous
backpropagation steps). This resulting jacobian-vector product now is the gradient
of the final loss value with respect to the input to this layer (and has shape [b, p]).
(Autograd also computes the gradient of the final loss value with respect to the
Parameters of the layer and accumulates them into those Parameters’ .grad
properties.)

Best.

K. Frank

Hi Frank,
Thank you for your reply.

Now that I think about it, doesn’t it have to include a batch dimension? - both forward and backward.

It doesn’t have to be this way, but this is typical.

In the forward direction we need to pass the batch dimension to each layer as if we perform some aggregation, we are losing information.
In the backward direction, the same requirement holds as if we consider a single neuron, we need to take vector-vector dot product (weighted average), and any aggregation will lead to incorrect gradient computations.

Autograd then computes the jacobian-vector product

Shouldn’t this be jacobian-matrix product as otherwise the output will be vector.
Moreover, just to confirm, the jacobian is the upstream gradient right?

Thank you so much for your prompt and detailed response.
I appreciate it :).

Hi Siddhanth!

Logically, it is not necessary to have a batch dimension. There are, however, a
number of built-in pytorch layers that do require a batch dimension (but such batch
dimensions can always have size 1). For example, Linear does not require a
batch dimension, but Conv2d does.

I’m not quite sure what you mean by this, but to clarify: Let’s say we had a batch
of five samples. If we wanted to get rid of the batch dimension, we wouldn’t do it
by somehow aggregating the five samples together – instead , we would take a
single sample from the batch and run it through the network, doing this five times.

Using standard terminology, if you have a scalar-valued function of a vector input,
f (x_j), the gradient is the vector made up of the partial derivatives, d f / d x_j.
If you have a vector-valued function of a vector input, f_i (x_j), the jacobian is
the matrix of partial derivatives, d f_i / d x_j. It’s just terminology, but in this sense
the jacobian is essentially a gradient, but for a vector-valued function.

A layer in a network is a vector-valued function. For example, a Linear (3, 5) maps
a vector of length 3 to a vector of length 5. When you backpropagate through a
Linear its jacobian is never explicitly computed (let alone materialized). Instead,
autograd implicitly computes the jacobian-vector product of the jacobian for that
Linear layer and the vector that is the gradient of the final scalar loss value with
respect to the output of that Linear, as computed by the previous steps in the
backpropagation. This jacobian-vector product is the gradient of the final scalar loss
value with respect to the input to that Linear (which is then passed back on up the
backpropagation chain),

Best.

K. Frank

Hi Frank,
Thank you for your reply.
I believe most of my confusion was due to my own lack of understanding.

Thank you for giving this detailed an explanation!
It was very clear.

Best regards,
Siddhanth Ramani