Calculating loss for mapped logits

I’d like to add a second loss term for a mapped class category (will explain) and I’m having trouble finding the appropriate way to do so.

My model produces logits for a multi-class classification problem and no issues using binary_cross_entropy_with_logits as normal. I’d like to add a second loss term based upon another level in the class hierarchy. So for each c_i, there’s a c_i → p_i class mapping, and similarly for the ground truth gt_c_i → gt_p_i (where the P != C naturally). So functionally, I would like to map the logits to a distribution over C, and then call cross_entropy on that to produce the loss to sum. How would I do that? I’ve tried quite a few variations and I feel like I’m missing something obvious.

Thank you in advance!

Hi George!

Could you clarify what this first level of your use case is?

To me, “multi-class classification” means that each data sample belongs to
exactly one of some number of classes. That is, for example, a given image
is an image of exactly one animal, say, “bat,” “bear,” or “dog” (but is not in more
than one class at the same time).

If this is the case, your model should predict (for this three-class example)
three unnormalized log-probabilities (These log-probabilities are technically
not logits.) and you would use CrossEntropyLoss as your loss criterion.

Does this correctly describe your use case, or are you doing something
different?

So, to extend my earlier example, you might also have images of “crow,”
“pigeon,” and “robin.” You would then like to group your six classes into
two superclasses:

mammals: bat, bear, dog
birds: crow, pigeon, robin

Is this what you mean by “mapped class category?”

Given that logits (appropriate for BCEWithLogitsLoss) are not the same as
unnormalized log-probabilities (appropriate for CrossEntropyLoss), there is
not really a natural mapping from class-logits to superclass-log-probabilities.

First, could you clarify whether you actually mean logits here?

Second, what is your goal with adding a cross-entropy loss at your superclass
level?

Third, is it your intention to use your superclass loss function *instead" of a
subclass loss function, or do want to combine the to together into a single
loss function?

Best.

K. Frank

The thorough response is greatly appreciated @KFrank!

Yes, you understand precisely and agreed on both the multi-class nature of the problem and the characterization of the superclasses as what I intended for “mapped class category”.

I was being a bit loose with my terminology, but I am using F.binary_cross_entropy_with_logits on the logits directly (without issue) currently.

The goal of adding the secondary loss at the superclass level is to penalize errors at that level, and the formulation will be somewhat different from straight BCE, but for these purposes let’s presume that’s the case and the intent is just to sum the loss from both levels to arrive at the loss to backprop.

Which is the crux of the question (I poorly communicated, so thanks for your patience!) - given that I know the appropriate weighting from the class to the secondary class and am comfortable generating a loss per class for the secondary class, how would I practically go about calculating and applying it? And more specifically, what’s the right way to compute it to that I can pass back the gradient?

I was setting up a sparse NxM matrix W, where there are N elements in C and M elements in P and W_i_j is only non-zero if c_i is the child of p_j. Then I should be able to take the logits (or perhaps more naturally as probabilities with a sigmoid), mult by W, and then run BCEWithLogitsLoss (or CrossEntropyLoss) vs the ground truth P. If those are tensors then, can I simply add that to the loss and be able to call backward there? (It didn’t seem so, but I might have another issue. I really should inspect the gradients more carefully here for a toy example.)

(Alternatively, I thought about having another layer with fixed weights and attempting to backprop through that, but that seemed problematic as well and I was likely missing something more fundamental.)

Appreciate any insight. In the meantime, I’ll take another pass at this and post some code here shortly if that’s not clear.

Thanks again!

Hi George!

You keep talking as if this is a (single-label) multi-class classification problem
(rather than a multi-label, multi-class problem).

But you state that you are using binary_cross_entropy_with_logits(),
which is not appropriate for the single-label case (but is the correct choice
for a multi-label problem).

(Note, I explained what a (single-label) “multi-class classification” problem
is in my first post.)

Before we can usefully address how (or whether) to add a superclass loss
term to your overall loss, you will have to clarify whether you a performing
a single-label or multi-label classification.

Best.

K. Frank

Yes, I am treating it as a multi-label, multi-class problem. (There is in practice only one class typically, but since there are cases where multiple classes can occur in some datasets, I structure it as multi-label. But if the single label case is much clearer here we could certainly proceed as if that is the case and I can adapt from there.)

Thanks again for your precision!

Hi George!

Regardless of whether yours is a multi-label problem or not, probably your
best approach is to compute your loss just at the subclass level and not add
a superclass term to it. If you train to predict the correct subclasses (which
is what you should be trying to do), you will also automatically be training to
predict the correct superclasses. You haven’t offered any explanation of why
you need or want to add a superclass term.

(You’ve told us more-or-less nothing about your concrete use case,* so its
hard to offer useful advice as to what approach might be best.)

Having said that, let me answer your technical question in the single-label
context.

Your network will predict unnormalized log-probabilities for your subclasses.
These will typically be the output of your final Linear layer (without any
subsequent non-linear activations). You would normally pass these directly
into CrossEntropyLoss.

Conceptually, you want to combine your subclass probabilities into superclass
probabilities that you would then pass into a superclass cross-entropy loss
function.

For numerical reasons (essentially the same reasons that CrossEntropyLoss
takes log-probabilities rather than plain probabilities), you should do all of this
in log-space – that is, always work with log-probabilities without ever explicitly
converting them to probabilities – combining your subclass log-probability
predictions into superclass log-probabilities (that you would then pass into a
superclass-level CrossEntropyLoss).

First pass the unnormalized subclass log-probabilities through log_softmax()
to convert them to normalized log-probabilities.

At this point, conceptually, you would use exp() to obtain subclass probabilities,
you would sum the subclass probabilities within a given superclass to obtain the
probability for that superclass, and then call log() to obtain that superclass’s
log-probability.

But we wish to perform this manipulation is log-space, so, instead, we will select
the subclass log-probabilities for the subclasses of a given superclass and “add”
them together with logsumexp() to obtain that superclass’s log-probability
“directly.”

Consider:

>>> import torch
>>> print (torch.__version__)
2.1.0
>>>
>>> _ = torch.manual_seed (2023)
>>>
>>> # example with six subclasses and two superclasses
>>>
>>> #  subclass       superclass
>>>
>>> #  0: bat         1: mammal
>>> #  1: bear        1: mammal
>>> #  2: crow        0: bird
>>> #  3: dog         1: mammal
>>> #  4: pigeon      0: bird
>>> #  5: robin       0: bird
>>>
>>> class_map = torch.tensor ([                       # membership of subclass (column) in superclass (row)
...     [False, False,  True, False,  True,  True],
...     [ True,  True, False,  True, False, False]
... ])
>>>
>>> subu = torch.randn (6)                            # unnormalized subclass log-probabilities
>>> subn = subu.log_softmax (dim = 0)                 # normalized subclass log-probabilities
>>>
>>> subn
tensor([-3.0564, -1.2996, -2.2345, -1.1579, -2.5913, -1.6918])
>>> subn.exp()
tensor([0.0471, 0.2727, 0.1070, 0.3141, 0.0749, 0.1842])
>>>
>>> supern = torch.empty (class_map.size (0))         # storage for superclass (normalized) log-probabilities
>>> for  i in range (class_map.size (0)):             # compute log-probability for each superclass
...     supern[i] = subn[class_map[i]].logsumexp (dim = 0)
...
>>> supern
tensor([-1.0047, -0.4559])
>>> supern.exp()
tensor([0.3661, 0.6339])

*) For example, what is the input data to your model? Images? Time series?
Sets of disparate descriptive values? And what does that data mean? What
are your classes? Is your training data balanced or unbalanced? How much
training data do you have? What are your most important performance metrics?

Best.

K. Frank