Understanding how loss and backward correctly for multiple leafs

I have a simple question that I fail to understand, how does the loss which is usually a single value, propagate correctly in a network with multiple output nodes.

I’ll give an example, assuming I have a network with X inputs and 2 outputs. The network get pictures of cats or dogs or something else. And should output high value in output node 1 for dog (low value if no dog), and high value for output node 2 for cats. (And low values for no cat).
Assuming I use MSE for my loss function, or some other loss function (CrossEntropyLoss) which outputs a single value, how does the loss propagate correctly for each node?

For example, let say I gave the network a batch with 1 pictures of a cat, and I got a vector of 2 (2 output nodes) which was (1,1) - therefor for node-2 I do not need to propagate any error but for node 2 I do need to propagate error. But when I do MSE I get a single loss value, how does this single loss value propagate correctly to each node.
My intuition would be that I’ll have a loss which is a vector a loss for node-1 and a loss for node-2 and that each loss will propagate to his node. But as I said the MSE gives back a single loss value.


Hi @orena1,

First, in autograd terms, leaf nodes are the input of the forward (the last on which backpropagation will be applied) and the root the output of the forward network.

Remember that when you use MSELoss, the last operation to be applied is a reduction (by default Mean but you can also choose sum with the reduction arg ).

So the first backward operation to be computed will the mean, as you can see here:

>>> loss(y, t)
tensor(0.1328, grad_fn=<MeanBackward0>) # <==

an operation which will propagate the gradient to every output node contributing to this mean.

I hope everything is clear. :slight_smile:

1 Like