The default loss function in multi class classification is cross_entropy, which treats all wrong guesses equally. If the distance between buckets are meaningful, for example, given the real bucket is 5, the guess 6 is considered 3 times better than 9, is there such function rewarding better guess (without losing the wights from probabilities as captured by cross_entropy )?
bump. any one can help pls?
I haven’t used it but presume the weights argument in the cross entropy loss covers this case CrossEntropyLoss — PyTorch 2.0 documentation
I am not sure though.
There are a couple of ways you could go about this.
First, you could use
CrossEntropyLoss’s probabilistic (“soft”) targets.
Let’s say you have five classes and the right answer is
2. Instead of
using a “hard,” integer class label of
2, you could use, for example, the
set of probabilities
[0.0, 0.25, 0.5, 0.25, 0.0]. So predicting
3 will not be penalized as much as predicting classes
Note, however, that predicting just class
2 (with high probability) will not
be the best prediction. Instead, you will be training your model to predict
a mix of classes
3 (with probabilities
[0.25, 0.5, 0.25]), which
might not be what you want.
Another approach would be to use a predicted-probability-weighted
L1Loss or whatever). Convert the (unnormalized) log-probabilities
predicted by your model into probabilities by passing them through
t be the correct class (for example
t = 2), you could use as
your loss function:
p * (t - 0)**2 + p * (t - 1)**2 + p * (t - 2)**2 + p * (t - 3)**2 + p * (t - 4)**2
t = 2, your best prediction – with a loss of zero – will be to predict
2 with probability one (
p = 1.0). Predicting, say, class
have a higher loss, and
4, higher still. So you do penalize different incorrect
However, when your prediction is completely wrong,
has a logarithmic divergence that I believe is very helpful for training. This
MSELoss-like loss doesn’t have such a divergence, which could be a
You could consider adding such a
MSELoss-like loss to the conventional
CrossEntropyLoss. Now your loss will be at its minimum (of
zero) when your prediction is completely correct, but will penalize worse
incorrect predictions more than not-as-bad incorrect predictions.
But think carefully about your use case. If the distance between your buckets
is meaningful and your buckets are ordered sequentially, then perhaps your
problem is better modelled as regression (rather than classification) and you
should use something like
MSELoss without any bells and whistles.
In any event, you should only use a non-standard loss, such as those
described above, if you can show that they work better on your problem
than a standard pure-classification with
CrossEntropyLoss or a standard
pure-regression with something like
Thank you Frank. Yes, I agree it can be also a regression rather than classification. If we use regression, can we still somehow get the probabilities?
I’m a little surprised that there is no builtin function for it – is such use case that rare? I will try your advise on customized loss function, too. Though I assume the performance would be not as good as the builtin
What, concretely, is this use case? Specifically, if you train a model for your
use case with loss-function A, and then train a second model with loss-function
B, how would you decide which model is better? What specific performance
metrics would you use to choose between the two models.
The loss function you train with is, in some sense, a training-friendly proxy for
the performance metrics that determine in a practical sense how well your model
is working, so the performance metrics tell us what your use case is (and the
loss function is a mere “implementation detail”).
I’m now using loss function like following:
def cross_entropy_mse_loss(input, target): return F.mse_loss(input.argmax(1),target)*a+F.cross_entropy (input,target)
And it seems working. I’m not sure if it mathematically legit. Also, what would be the right coefficient a? for now I set a=1.
This won’t do what you want – the
mse_loss() term won’t have any effect.
This is because
input.argmax(1) returns an integer and is therefore not
(usefully) differentiable. From memory, autograd won’t backpropagate through
argmax() (but if it did, it would backpropagate a zero gradient).
To add a differentiable mse-like term to your combined loss function, consider
using something like the predicted-probability-weighted
MSELoss I suggested
in my previous post.
I don’t know of any a priori best value for
a. You should treat is as a tunable
hyperparameter (like a learning rate or weight-decay coefficient) and perform
multiple training runs with different values and see which value works best.
(But, as noted above, with your specific proposal, the value of
a won’t have
any effect on your training.)