Mean Reduction in nll_loss

According to nll_loss documentation, for reduction parameter, " 'none' : no reduction will be applied, 'mean' : the sum of the output will be divided by the number of elements in the output, 'sum' : the output will be summed."

However, it seems “mean” is divided by the sum of the weights of each element, not number of elements in the output. Am I misunderstanding the documentation or the documentation is inaccurate?

input = torch.randn(3, 5, requires_grad=True)
ls =F.log_softmax(input, 1)
ls
tensor([[-2.5076, -2.2387, -0.4952, -2.2435, -2.3390],
[-0.9897, -1.8469, -1.6642, -1.4942, -2.8676],
[-0.7173, -3.8274, -2.2838, -1.0784, -3.0335]],
grad_fn=)
F.nll_loss(ls, torch.tensor([0,0,4]), reduction=“none”, weight=torch.tensor([1.0, 1.0, 1.0, 1.0, 3.0]))

tensor([2.5076, 0.9897, 9.1005], grad_fn=)

F.nll_loss(ls, torch.tensor([0,0,4]), reduction=“sum”, weight=torch.tensor([1.0, 1.0, 1.0, 1.0, 3.0]))

tensor(12.5978, grad_fn=)

F.nll_loss(ls, torch.tensor([0,0,4]), reduction=“mean”, weight=torch.tensor([1.0, 1.0, 1.0, 1.0, 3.0]))

tensor(2.5196, grad_fn=)

Hello Johnson!

You are correct that in the weighted case, “mean” is calculated by
dividing by the sum of the weights. (In other words, a weighted
average
is being calculated.)

The documentation on this is arguably inaccurate, or at least rather
unclear.

Anyway, the way I think of it, if I have ten samples, all with the same
loss, say, 1.23, I would like my “mean” loss, weighted or not, to be
this same loss value, 1.23. So dividing by the sum of the weights is,
for me, the “expected behavior,” even if the documentation says
otherwise.

(This issue has come up before on the forum, although I don’t have
any references at hand.)

Best.

K. Frank

Anyway, the way I think of it, if I have ten samples, all with the same
loss, say, 1.23, I would like my “mean” loss, weighted or not, to be
this same loss value, 1.23. So dividing by the sum of the weights is,
for me, the “expected behavior,” even if the documentation says
otherwise.

Hello Frank, I think the example you gave is actually the expected behavior as described in the documentation. The mean should be divided by the number of samples, not by the sum of the weights of the expected class. That is what I expected from reading the documentation.

Hello Johnson!

No, I think you have my example backwards.

Let’s say that you have three classes, 0, 1, 2. Let’s say you also
have three samples, one from each class, in order. Now suppose
that the (unweighted) per-sample loss happens to be the same for
all three samples, 1.23.

If you pass in a weight vector (class weights) of [1.0, 1.0, 3.0],
then after weighting, but before taking the “mean,” you will have a
summed, weighted loss of
1.0 * 1.23 + 1.0 * 1.23 + 3.0 * 1.23 = 5.0 * 1.23 = 6.15

If you divide by the number of samples, 3, you will get 2.05. If you
divide by the sum of the weights, 5.0, you will get 1.23, the loss
value, that in this example, happens to be the same for each of the
three samples.

If I “average” together a bunch of numbers that all happen to have
the same value, to me it makes sense that the average should be
that value. That’s what I meant by saying that this is my “expected
behavior” (regardless of the documentation, which I do think is
inaccurate).

Best.

K. Frank

Thanks for the explanation K. Frank.