How to calculate class weights for imbalanced data

I am dealing with a binary classification problem where the data is imbalanced. I try to train the model with weighted cross-entropy loss or weighted focal loss, how can I calculate the weights for each class?
Suppose there are n0 examples of the negative class and n1 examples of the positive class; currently I calculated the weights for each classes as follow:

weight for negative class: 1-n1/(n0+n1)
weight for positive class: 1-n0/(n0+n1)

In this way the sum of the weights is normalised to 1. Is this the right way to calculate class weights? Are there any better way to calculate the class weights ?

1 Like

This looks like the correct way to do it as long as it is computed on the training dataset (to avoid data leakage).

You can also simplify the expressions: no need for 1 - the ration, you can use directly n0 and n1 respectively.

Thanks for your reply. May I use the validation dataset to calculate the class weights?

essentially you wanna make sure your dataset is stratified meaning you have the same number of positive labels in train, test, and validation set (so that the ratio of data would be same in all 3 sets).

1 Like

Always use the train part and not the validation one so that your evaluation isn’t biased (or at least as low as possible). Indeed, if you estimate some parameters and then use these on the same part, you will have optimistic results.

Thanks for your reply. In my case I have multiple datasets and I will validate and test the model on one dataset and train the model with the other datasets. All datasets are imbalanced and the negative examples are the majority, but the ratios of negative examples are different among different datasets. How can I deal with this case?

1 Like

If the datasets are different, it might be better to train one model per dataset, otherwise, it will be hard to generalize. If you go this route, you will have a negative ratio estimation per dataset.

Otherwise, you will need additional work to make sure that the different datasets are “aligned” and thus the trained final model is good enough. If you select this route, you could estimate one negative ratio once you have aligned the different datasets.

I recently came across this other formula for weights for binary cross entropy. Is this wrong?

Class 0 = (count_0 + count_1)/count_0 = (38+240)/240 = 1.158
Class 1 = (count_0+count_1)/count_1 = (38+240)/38 = 7.315

As you see, the weights don’t sum up to 1.

I think it does not depend on the absolute value of the weights but on their ratio