from my understanding the weight parameter in CrossEntropyLoss is behaving different for mean reduction and other reductions. I believe in case of non-mean reductions the sample loss is just scaled by respective class weight for that sample. Yet, in the case of mean reduction, the loss is first scaled per sample, and then the sum is normalized by the sum of weights within the batch.

The important thing to me is the last part, namely that it is normalized by weights within the batch. Is this observation of mine correct? I wonder what is the idea behind that? If I am using weights to e.g., balance across the dataset, wouldn’t I want to normalize globally?

This is correct (if I understand what you are saying).

The “mean reduction” computes a (conventional) weighted average,
that is, it does divide by the sum of the weights.

This makes sense to me because if, by happenstance, all of the
samples in the batch have the same loss, loss_all, I would like the
mean reduction over that batch also to give a batch mean of loss_all.
The conventional weighted average does this.

These two threads (about NLLLoss, but its the same issue) give some
additional words of explanation:

thanks for the elaboration. It still does not fully make sense to me. Consider the following example where I have 4 samples and two batches with 2 samples each.

Now my first batch has weights [1,1] and my second batch has weights [2,2], and for all 4 samples the loss is 1, then both batches have the exact same overall loss. This does not sound intuitive to me as I want to weigh the second batch higher.

First, note that in practice you can use reduction ='sum', and then
sum the the two per-batch losses together to get the relative weighting
you want. (You can divide by total number of samples, if you choose.)

You do, however, raise a legitimate point: Splitting a batch in two, and
averaging together the weighted averages of the two sub-batches
won’t give the same result as taking the weighted average over the
unsplit batch. But it’s a trade-off, and you can’t have everything. I’m
willing to give this up in order to have the weighted average of a bunch
of identical losses be equal to that shared loss value.

(Note, if you take an appropriately weighted average of the sub-batch
weighted averages, you will get what you want, although, in practice,
going this route likely confuses the issue.)