Correct way to set cross entropy weights in an umbalanced dataset?

HugoT · June 12, 2024, 3:19pm

In a classification problem, if I have an unbalanced train dataset, let’s say 40 elements of class 1 and 60 elements of class 2, what should I put in the weight parameter of nn.CrossEntropyLoss so this imbalance is considered so it emulates a balanced dataset?

Is weights = [1/40, 1/60] the correct answer?

KFrank · June 14, 2024, 2:19am

Hi Hugo!

Yes, this would be a fine starting point.

A couple of comments:

40 / 60 is not very imbalanced. I probably wouldn’t bother with class
weights.

Your results should be largely insensitive to the exact values of your class
weights – if they’re not, you should consider the possibility that their might
be something unstable about your model or training procedure.

If you only have two classes, you might choose to structure your problem
as a binary problem (rather than a multiclass problem that happens to
have two classes) and use BCEWithLogitsLoss and pos_weight, rather
than CrossEntropyLoss and weight. It doesn’t really matter much – the
two approaches are very similar.

Best.

K. Frank