Own loss function for multi class classifikation

Hey everyone,
Currently I am developing an multiclass classifikator. I try to replace the nn.CrossEntropyLoss with my own loss function. The background is that I want to penalize some missclassification stronger than others. For example I have 4 classes, then the following matrix will describe the penalty:

0 , 1 , 4 , 5
1 , 0 , 3 , 8
4 , 3 , 0 , 2
5 , 8 , 2 , 0

As you can see, a misclassification from a sample from class 1 as class 2 is not so fatal as class 4. In the first case, my loss function should return 1 and in the second case 5. I started to implement the function as follow:

class MyLoss(nn.Module):
    def __init__(self):
         super(MyLoss, self).__init__()
    def forward(self, output, target):
           _, predicted = torch.max(output, axis=1)
           loss = []
           for i in range(len(predicted):
           return torch.mean(loss)

I figured out that the loss can not be backpropagated properly since the torch.max function destroys the backpropagation graph. Has anybody any idea how to replace the torch.max function with a differentiable function. I know that I can use softmax(output)[1] for a 2 class classification problem. Is there something similar for the n class problem?

Thanks in advance!!

Hi roxor!

One approach is to use cross-entropy with probabilistic or soft targets.

This post has further discussion and details:


K. Frank

Hi KFrank,
thanks for your fast reply!
I read the post you mentioned and try to implement it. The first aspect I recognised was that my target vector must have the shape of (Batchsize, C) now instead of (Batchsize, ). So instead of the ground truth value, I have to pass in the corresponding line of the distance matrix. Did I get it right?

Attached you can find my first draft: I made some modifications from your code since, as far as I know the usage of Variabel is deprected:

def softXEnt (input, target):
    logprobs = torch.nn.functional.log_softmax (input, dim = 1)
    return  -(target * logprobs).sum() / input.shape[0]

# input values are logits
# 4 possible classes and 2 samples 
inputBad = torch.tensor([[3,5,2.5,6], [7,5.5,2,0]])
inputGood = torch.tensor([[0,0,0,1], [1,0,0,0]], dtype=torch.float)

# ground truth is [4,1]
# Recap the distance matrix is:
# 0 , 1 , 4 , 5
# 1 , 0 , 3 , 8
# 4 , 3 , 0 , 2
# 5 , 8 , 2 , 0

# For softmax we need a float tensor
penalize_target = torch.tensor([[5 , 8 , 2 , 0], [0 , 1 , 4 , 5]], dtype=torch.float)
target = torch.nn.functional.softmax(penalize_target, dim = 1)

print("SoftXEnt result bad prediction")
print(softXEnt (inputBad, target))
print("SoftXEnt result good prediction")
print(softXEnt(inputGood, target))

And the output is.

SoftXEnt result bad prediction
SoftXEnt result good prediction

As we can see, the pad predication is stronger penalized than the good prediction of the classificator. However, shouldn’t the good prediction result in a tensor(0.0) since the prediction is equal to the ground truth?

Thanks! Looking forward to your opinion!

Hi roxor!

Your inputBad and inputGood don’t represent the predictions you
think they do. As your comment notes, they are logits. To make them
a little easier to understand, let’s turn them into probabilities with

>>> torch.nn.functional.softmax (inputBad, dim = 1)
tensor([[3.4387e-02, 2.5408e-01, 2.0857e-02, 6.9067e-01],
        [8.1249e-01, 1.8129e-01, 5.4745e-03, 7.4090e-04]])
>>> torch.nn.functional.softmax (inputGood, dim = 1)
tensor([[0.1749, 0.1749, 0.1749, 0.4754],
        [0.4754, 0.1749, 0.1749, 0.1749]])

First note that inputGood assigns your ground-truth label (class-4
and class-1 for your two samples, respectively) of 47.5% for both
samples. Your ground-truth label does have the highest probability,
but it does not have a probability of 100%.

To repeat this, the input vector, [0, 0, 0, 1], does not mean a
probability of 100% for class-4 – it is a vector of logits that translates
to a probability of 47.5% for class-4. So it is not a pure class-4

Second, inputBad translates to ground-truth probabilities of 69.1%
and 81.2% for the first and second samples, respectively. So
inputBad actually gives stronger predictions for the ground truth
than does inputGood.

Now let’s look at penalize_target:

>>> torch.nn.functional.softmax(penalize_target, dim = 1)
tensor([[4.7299e-02, 9.5003e-01, 2.3549e-03, 3.1870e-04],
        [4.8372e-03, 1.3149e-02, 2.6410e-01, 7.1791e-01]])

In both cases your target probability is smallest for your ground-truth
entry (class-4 and class-1, respectively). The problem is that a small
number in your distance matrix (small is “good,” and zero is “perfect”)
translates into a small probability – backwards of what you want.

A couple of comments:

First, although you could reverse-engineer soft-cross-entropy to give
you a distance-matrix-like loss function by appropriately adjusting
your soft targets, you don’t really want a distance-matrix loss. You
really do want something like cross-entropy, with its logarithmic
divergences. In practice, it seems to train better.

Take a look at the numerical details of the class-4 / class-9 example
in the post I linked to. That shows concretely how to penalize certain
misclassifications more than others.


K. Frank

Hi KFrank,
thanks for the explenation about the logit and all the details about my code.

Actually, all I want is the distance-matrix loss.

I read the class-4 / class-9 example again. However, I don’t get how to use it in my example since my data are labeled correctly. I want to penalize the predictions based on the matrix and I don’t get it how to do it with the approach you recommended in the example. How would you do this?


Hi roxor!

Two comments:

First, as I said initially, cross-entropy with soft targets it a way to penalize
some misclassifications differently from others. But it is not the same as
a difference-matrix loss.

Second, you should think hard about whether you really want to use a
difference-matrix loss. What you want in practice is a loss that causes
your network to train effectively and then give accurate predictions.
A cross-entropy-like loss is likely to work better in this regard. In any
event, you shouldn’t just assume that a distance-matrix loss will work
better – at a minimum, you should try both and see which gives better


K. Frank