How to calculate class weights for imbalanced data

fatcat · March 1, 2022, 2:40pm

I am dealing with a binary classification problem where the data is imbalanced. I try to train the model with weighted cross-entropy loss or weighted focal loss, how can I calculate the weights for each class?
Suppose there are n0 examples of the negative class and n1 examples of the positive class; currently I calculated the weights for each classes as follow:

weight for negative class: 1-n1/(n0+n1)
weight for positive class: 1-n0/(n0+n1)

In this way the sum of the weights is normalised to 1. Is this the right way to calculate class weights? Are there any better way to calculate the class weights ?

yassineAlouini · March 1, 2022, 2:57pm

This looks like the correct way to do it as long as it is computed on the training dataset (to avoid data leakage).

You can also simplify the expressions: no need for 1 - the ration, you can use directly n0 and n1 respectively.

fatcat · March 1, 2022, 9:08pm

Thanks for your reply. May I use the validation dataset to calculate the class weights?

Mona_Jalal · March 1, 2022, 10:28pm

essentially you wanna make sure your dataset is stratified meaning you have the same number of positive labels in train, test, and validation set (so that the ratio of data would be same in all 3 sets).

yassineAlouini · March 2, 2022, 8:22am

Always use the train part and not the validation one so that your evaluation isn’t biased (or at least as low as possible). Indeed, if you estimate some parameters and then use these on the same part, you will have optimistic results.

fatcat · March 2, 2022, 5:49pm

Thanks for your reply. In my case I have multiple datasets and I will validate and test the model on one dataset and train the model with the other datasets. All datasets are imbalanced and the negative examples are the majority, but the ratios of negative examples are different among different datasets. How can I deal with this case?

yassineAlouini · March 3, 2022, 9:23am

If the datasets are different, it might be better to train one model per dataset, otherwise, it will be hard to generalize. If you go this route, you will have a negative ratio estimation per dataset.

Otherwise, you will need additional work to make sure that the different datasets are “aligned” and thus the trained final model is good enough. If you select this route, you could estimate one negative ratio once you have aligned the different datasets.

Mona_Jalal · May 17, 2022, 7:35pm

I recently came across this other formula for weights for binary cross entropy. Is this wrong?

Class 0 = (count_0 + count_1)/count_0 = (38+240)/240 = 1.158
Class 1 = (count_0+count_1)/count_1 = (38+240)/38 = 7.315

As you see, the weights don’t sum up to 1.

badgiojuni · January 26, 2023, 8:40am

I think it does not depend on the absolute value of the weights but on their ratio