Custom loss function for NLP multiclass classification problem

I need help/advice/example regarding the approach in the development of PyTorch custom-loss function in NLP multiclass classification.
The dataset looks something like this:
text1 ‘AC’
text2 ‘AD’
text3 ‘BC’
text4 ‘BC’
text5 ‘BD’
…the rest of the dataset…

Labels ‘AB’ or ‘CD’ are impossible from the business perspective and will not appear in the dataset. The first part of the label is always ‘A’ or ‘B’ and the second is always ‘C’ or ‘D’. So, there are only four possible classes: ‘AC’, ‘AD’, ‘BC’, BD’.

I need multiclass classification to four classes, ‘AC’, ‘AD’, ‘BC’, BD’, but I need a loss function that will include and learn/implement the relation between components A-B/C-D. For example, the penalization factor for the wrong prediction of the first label should be alpha and the second beta. It would be great to learn these parameters alpha and beta. Of course, if someone has a better suggestion, instead of these parameters, I would follow it.

For example, to simplify, let these labels present labels of sentiment analysis. A component means emotionally, B component means unemotionally, C component means positive, D component means negative (expression in the text).

What would be the best way to do that? Are there steps I should do in feature engineering, or it’s possible to implement just through the loss function (preferable)? Some code snippets are more than welcome.

Thank you

Hi Ninoslav!

You should understand your problem to be a multi-label, multi-class
problem in the following sense:

It is multi-class because you have have multiple classes – in your case,
two classes, “emotion” and “positivity.” And is is multi-label because
each sample is given not just one class label (i.e., “emotion” or
“positivity,” but not both) but rather multiple labels, potentially one for
each of your two classes.

That is, each of your samples can be labelled with neither, one, or
both of your two classes. Said in another (but equivalent) way, each
sample carries two binary labels: first binary label – “no-emotion”
vs. “yes-emotion”; and second binary label – “no-positivity” vs.

(It is true that you can structure this as a single-label, four-class problem
with the four labels being your ‘AC’, ‘AD’, ‘BC’, ‘BD’, but it is generally
better to structure this kind of problem as a multi-label problem.)

The preferred loss function will be BCEWithLogitsLoss. Note that this
loss function explicitly supports “the case of multi-label classification.”

As noted above, you’ll be better of with two classes, but use the
multi-label structure. Just to be clear, your “A-B” is the first of two
binary labels and “C-D” is the second.

Here is one benefit of using BCEWithLogitsLoss with the multi-label
approach. Use BCEWithLogitsLoss's weight constructor-argument
with a tensor, [[alpha, beta]] (of shape [1, 2]). (The leading
singleton dimension of weight will be broadcast along the batch
dimension, so that you’ll be using the same class weights for each
sample in batch.) This will weight the (binary-cross-entropy) contribution
of your “emotion” class by alpha in your loss function and weight
“positivity” with beta.

I don’t think learning alpha and beta really makes sense.

Maybe you care more about getting “emotion” right than “positivity.”
Or maybe the other way around. I don’t see how your network can
know this without you telling it. You tell the network how much you
care about “emotion” vs. “positivity” with your alpha and beta
parameters, so the network doesn’t have any way to “learn” this
preference of yours.

Yes, as outlined above, using just a loss function – specifically the
multi-label case of BCEWithLogitsLoss – it is possible – and likely
the best way – to implement your classifier.

(Just to be sure, I used your two-class example – “emotion” and
“positivity” – for simplicity and to follow along with your post. But
there is nothing about this scheme that is restricted to two classes.
You could, for example, label your natural-language phrases with five
features, say “emotion,” “positivity,” “certainty,” “conciseness,” and
“understandability,” and use BCEWithLogitsLoss to build a multi-label,
five-class classifier.)

Good luck.

K. Frank

Hi Frank,

Thank you for such a complete reply and feedback. I appreciate it.

I’m aware that we can approach the problem as a multi-label, multi-class. I am trying to treat it as a multi-class problem (for the initial set of four classes, ‘AC’, ‘AD’, ‘BC’, ‘BD’) but would like to include ‘impact’ of class components. Do you have any ideas in that direction?

Hi Ninoslav!

I have a few comments:

First, to reiterate, since you want to “look inside” you classes, drilling
down into what you call the “class components,” it seems unnatural to
me to treat this as a single-label, four-class problem rather than a
multi-label, two-class problem. By using your four classes, you’ve
“hidden” their components that you want access to, just making things

Now some thoughts about your approach:

The most common approach for a (single-label) four-class problem
would be to have the final Linear layer of your model output four
raw-score logits that you then pass into CrossEntropyLoss. (The
ground-truth target passed into CrossEntropyLoss will be a single
integer class label (for each sample in the batch) that takes on the
values {0, 1, 2, 3}.)

You can convert the logits into probabilities for your four classes by
passing them through softmax(). (CrossEntropyLoss has, in effect,
softmax() built in.)

Note, the predictions coming out of your model are logits (that can be
converted to probabilities) specifically because your trained them to be
by using CrossEntropyLoss as the loss criterion you backpropagate.
Were you to use some other loss criterion for training the meaning of
your model outputs would likely change.

Nonetheless, let’s work under the assumption that your model outputs
remain logits even as we modify the loss criterion.

So: Convert your model outputs to probabilities, and compute the
probabilities for your class “components”:

P(A) = P(AC) + P(AD)
P(B) = P(BC) + P(BD)
P(C) = P(AC) + P(BC)
P(D) = P(AD) + P(BD)

Note that P(A) + P(B) = 1 (which is consistent because the
first component is either in state A or state B), and, similarly,
P(C) + P(D) = 1.

Now consider your ground-truth target: target = 1, for example,
means class AD, so 100% probability of the first component being
in state A (so 0% probability in state B) and 0% probability of the
second component being in state C (so 100% probability in state D).

You now have the both predicted and ground-truth probabilities of your
first component being in state A so you can use the mismatch between
these two probabilities (for example, the cross entropy) as your loss
criterion for the first component of your class, and weight it with alpha.

Similarly, you could use the mismatch between the predicted and
ground-truth probabilities for your second component being in state
C as your second-component loss criterion, and weight it with beta.

You can now train your model with this combined alpha and beta
loss criterion. Problem solved!

However …

What you’ve actually done here is built a (weighted) multi-label
binary-cross-entropy loss criterion for the case where your multi-label
predictions are encoded in an unusual way.

How can the four outputs of the final Linear layer of your model turn
into multi-label predictions (with an unusual encoding) instead of being
the four conventional single-label predictions for your four classes?

Because that’s what you’ve trained them to be. (The first rule of neural
networks is that the output of your model means what you train it to

There’s nothing wrong nor inconsistent about doing things this way
(although it does seem like you would be standing on your head), but
the meaning of your model’s predictions is no longer the same as
if you had trained it as a conventional single-label, four-class model.
Instead, your model is making multi-label, two-class predictions, albeit
encoded in an unusual way.


K. Frank

Hi Frank,
Thank you for your replies and suggestions. Regarding the following part:

There’s nothing wrong nor inconsistent about doing things this way
(although it does seem like you would be standing on your head), but
the meaning of your model’s predictions is no longer the same as
if you had trained it as a conventional single-label, four-class model.
Instead, your model is making multi-label, two-class predictions, albeit
encoded in an unusual way.

Isn’t it true when I consider both ‘components’, I should push the model to train for predicting ‘both components’, in other words, a four-class model – this way through ‘components’?

Thank you

Hi Ninoslav!

Yes, you can look at it this way, and you can do it this way.

The details of training will be different, but there is an equivalence
between a multi-label, two-class problem and a single-label,
four-class problem.

Let’s say that in the first instance your classes are A and B, and in
the second, 0, 1, 2, 3.


   A--no, B--no  -->  0
   A-yes, B--no  -->  1
   A--no, B-yes  -->  2
   A-yes, B-yes  -->  3

gives an explicit mapping between the two problems.

My intuition tells be you will get better training and better results treating
this as a multi-label, two-class problem. But you should also be able to
get things to work treating this as a single-label, four-class problem.

Your call.


K. Frank

Hi @KFrank,

Thank you.
I didn’t quite understand your explanation in this part:

Let’s say that in the first instance, your classes are A and B, and in
the second, 0, 1, 2, 3.
A–no, B–no → 0
A-yes, B–no → 1
A–no, B-yes → 2
A-yes, B-yes → 3

Could you be so kind as to explain it a little bit further?
Also, do you have any suggestion regarding constructing the custom loss in this case? We should convert new calculated probabilities into logits again using F.sigmoid; am I right?
Thank you