Multi-Classification and CrossEntropyLoss - interplay of *weight* and *ignore_index*?

Hello there.
First of all, im having a great time playing around with PyTorch - thanks for that :slight_smile:

Now to my question.
Im building a CNN for sequential data using CrossEntropyLoss. I have 3 classes as output of which two are of interest and the last is used if one of the two wasnt a fitt:

  • Class 0: no fitt
  • Class 1: interest 1
  • Class 2: interest 2

All three classes are exclusive and my labeld training data is highly unbalanced in this order of occurences: Class 0 >> Class 1 > Class 2. To compensate for the occurences im computing weights for the weight setting of the CrossEntropyLoss like this:

def weights(labels, classes):  # classes = 3
    hist = torch.histc(labels, classes)
    m = hist.min()
    w = torch.tensor([v if v == 0 else m / v for v in hist])
    return w

After reading the documentation of CrossEntropyLoss again i stumbled over the ignore_index parameter which i havent noticed before. So i was wondering if this would help me but im not quite sure how to use this parameter in conjunction with the weights or to use it at all.

Would it make sense to set the ignore_index parameter to my Class 0?
If this is the case do i have to adapt my weight computation to ignore Class 0?

Hello Anyere!

Unless there is something unusual about your classification
problem, I think you should not use ignore_index for Class 0.

Suppose you feed a Class-0 sample to your model (so target
is 0), and your model incorrectly classifies it as Class 2. You
do want to tell your model not to do this, so you do want this
incorrect classification to add a penalty to your loss function.

If you set ignore_index = 0, this sample (and all other Class-0
samples) will not contribute to the loss function, and so Class-0
misclassifications will not be penalized.

(Because your training data is unbalanced, you may indeed want
to deweight the Class-0 samples in your training set – either by
using CrossEntropyLoss’s weight argument, or by sampling
fewer of your Class-0 samples. But ignoring them altogether goes
too far and would be hiding import information from your model’s
training.)

Good luck!

K. Frank

1 Like

Thank you for your answer @KFrank ! this was actually my gut feeling but wanted some other opinions.

One further question:
I am recomputing the weights for each minibatch. Does the shown computation of the weights represent the occurences of my labels properly for the CrossEntropyLoss?

Hi Anyere!

With one correction, your weight computation looks correct.

Note, that in the case that hist.min() == 0 (perfectly possible), all
of your weights will be zero. I would eliminate m = hist.min(), and
go with 1.0 / v (in place of m / v).

A couple of comments:

I don’t think that the overall normalization of the weights matters. I
think that if you use reduction = 'mean' in your CrossEntropyLoss
(the default), the overall scale factor of your weights will drop out. (I.e.,
something like weight = 123.4 * weight won’t have any effect.)

I would probably just calculate the weights for the entire training set,
and not bother doing it on a per-batch basis. (I think both approaches
are reasonable.) As you take multiple steps in your gradient-descent
optimization algorithm, you are, in a sense, averaging over multiple
batches, so there is nothing magic about reweighting individual batches.

Last, using weight-of-class = 1 / number-of-samples-in-class
is perfectly reasonable, and I think somewhat standard. But other
choices are okay. You might experiment with only partially reweighting
your classes.

Good luck!

K. Frank

1 Like

Hey Frank

I think hist.min() cant be zero since my labels are exclusive. There has to be always exactly one true of all three classes. And for the normalisation I just chose 1 as it is the most common value in those situations.

I was wondering about that too and decided to recalculate on a per batch basis for those cases in which my sequence does not have one class to appear at all. As far as I understand this will make the true meaning of this sequence more clear and granular for the backpropagation.

As a side note, when I’m doing my evalutation on test data I reset all weights to 1, since this would be the normal usecase of the model in the wild - this makes sense, right?

Here is one followup question :slight_smile:
During training I have sequences of variable length. For not having unnecessary zero-padding within my CNN I want to feed one sequence as a whole to my model. But for backpropagation I thought it would make sense to compute the losses on a fixed window to make the loss computation independent of the incoming sequence length. So i came up with this routine (its a simplified version ignoring the weight computation):

inputs, labels = sequence_data
outputs = model(inputs)

for l in range(0, len(labels), window_size):
    r = l + window_size
    loss = criterion(outputs[l:r], labels[l:r])
    optimizer.zero_grad()
    retain_graph = r < len(labels)
    loss.backward(retain_graph=retain_graph)
    optimizer.step()

For this to work I have had to use the retain_graph parameter of the backpropagation and I found the documentation on this feature a bit sparse. Do I use this correct and could someone explain to me whats happing there in more detail and what data will be retained?

I checked the code again and realised this situation could happen which I have not foreseen. Thanks for the headsup @KFrank But since the following is still true

I fixed this issue by removing all zeros before taking the minimum: m = hist[hist != 0].min()

This article got me covered:
https://towardsdatascience.com/pytorch-autograd-understanding-the-heart-of-pytorchs-magic-2686cd94ec95