The default loss function in multi class classification is cross_entropy, which treats all wrong guesses equally. If the distance between buckets are meaningful, for example, given the real bucket is 5, the guess 6 is considered 3 times better than 9, is there such function rewarding better guess (without losing the wights from probabilities as captured by cross_entropy )?

bump. any one can help pls?

I haven’t used it but presume the weights argument in the cross entropy loss covers this case CrossEntropyLoss — PyTorch 2.0 documentation

I am not sure though.

Hi Jerron!

There are a couple of ways you could go about this.

First, you could use `CrossEntropyLoss`

’s *probabilistic* (“soft”) targets.

Let’s say you have five classes and the right answer is `2`

. Instead of

using a “hard,” integer class label of `2`

, you could use, for example, the

set of probabilities `[0.0, 0.25, 0.5, 0.25, 0.0]`

. So predicting

classes `1`

or `3`

will not be penalized as much as predicting classes `0`

or `4`

.

Note, however, that predicting just class `2`

(with high probability) will not

be the best prediction. Instead, you will be training your model to predict

a mix of classes `1`

, `2`

, and `3`

(with probabilities `[0.25, 0.5, 0.25]`

), which

might not be what you want.

Another approach would be to use a predicted-probability-weighted `MSELoss`

(or `L1Loss`

or whatever). Convert the (unnormalized) log-probabilities

predicted by your model into probabilities by passing them through `softmax()`

.

Then, letting `t`

be the correct class (for example `t = 2`

), you could use as

your loss function:

```
p[0] * (t - 0)**2 + p[1] * (t - 1)**2 + p[2] * (t - 2)**2 + p[3] * (t - 3)**2 + p[4] * (t - 4)**2
```

For `t = 2`

, your best prediction – with a loss of zero – will be to predict

class `2`

with probability one (`p[2] = 1.0`

). Predicting, say, class `3`

will

have a higher loss, and `4`

, higher still. So you do penalize different incorrect

predictions differently.

However, when your prediction is completely wrong, `CrossEntropyLoss`

has a logarithmic divergence that I believe is very helpful for training. This

`MSELoss`

-like loss doesn’t have such a divergence, which could be a

disadvantage.

You could consider adding such a `MSELoss`

-like loss to the conventional

hard-label `CrossEntropyLoss`

. Now your loss will be at its minimum (of

zero) when your prediction is completely correct, but will penalize worse

incorrect predictions more than not-as-bad incorrect predictions.

But think carefully about your use case. If the distance between your buckets

is meaningful and your buckets are ordered sequentially, then perhaps your

problem is better modelled as regression (rather than classification) and you

should use something like `MSELoss`

without any bells and whistles.

In any event, you should only use a non-standard loss, such as those

described above, if you can show that they work better on your problem

than a standard pure-classification with `CrossEntropyLoss`

or a standard

pure-regression with something like `MSELoss`

or `L1Loss`

.

Good luck!

K. Frank

Thank you Frank. Yes, I agree it can be also a regression rather than classification. If we use regression, can we still somehow get the probabilities?

I’m a little surprised that there is no builtin function for it – is such use case that rare? I will try your advise on customized loss function, too. Though I assume the performance would be not as good as the builtin `CrossEntropyLoss`

Hi Jerron!

What, concretely, is this use case? Specifically, if you train a model for your

use case with loss-function A, and then train a second model with loss-function

B, how would you decide which model is better? What specific performance

metrics would you use to choose between the two models.

The loss function you train with is, in some sense, a training-friendly proxy for

the performance metrics that determine in a practical sense how well your model

is working, so the performance metrics tell us what your use case is (and the

loss function is a mere “implementation detail”).

Best.

K. Frank

I’m now using loss function like following:

```
def cross_entropy_mse_loss(input, target):
return F.mse_loss(input.argmax(1),target)*a+F.cross_entropy (input,target)
```

And it seems working. I’m not sure if it mathematically legit. Also, what would be the right coefficient a? for now I set a=1.

Hi Jerron!

This won’t do what you want – the `mse_loss()`

term won’t have any effect.

This is because `input.argmax(1)`

returns an integer and is therefore not

(usefully) differentiable. From memory, autograd won’t backpropagate through

`argmax()`

(but if it did, it would backpropagate a zero gradient).

To add a differentiable mse-like term to your combined loss function, consider

using something like the predicted-probability-weighted `MSELoss`

I suggested

in my previous post.

I don’t know of any a priori best value for `a`

. You should treat is as a tunable

hyperparameter (like a learning rate or weight-decay coefficient) and perform

multiple training runs with different values and see which value works best.

(But, as noted above, with your specific proposal, the value of `a`

won’t have

any effect on your training.)

Best.

K. Frank