I am dealing with a binary classification problem where the data is imbalanced. I try to train the model with weighted cross-entropy loss or weighted focal loss, how can I calculate the weights for each class?
Suppose there are n0 examples of the negative class and n1 examples of the positive class; currently I calculated the weights for each classes as follow:
weight for negative class: 1-n1/(n0+n1)
weight for positive class: 1-n0/(n0+n1)
In this way the sum of the weights is normalised to 1. Is this the right way to calculate class weights? Are there any better way to calculate the class weights ?
essentially you wanna make sure your dataset is stratified meaning you have the same number of positive labels in train, test, and validation set (so that the ratio of data would be same in all 3 sets).
Always use the train part and not the validation one so that your evaluation isn’t biased (or at least as low as possible). Indeed, if you estimate some parameters and then use these on the same part, you will have optimistic results.
Thanks for your reply. In my case I have multiple datasets and I will validate and test the model on one dataset and train the model with the other datasets. All datasets are imbalanced and the negative examples are the majority, but the ratios of negative examples are different among different datasets. How can I deal with this case?
If the datasets are different, it might be better to train one model per dataset, otherwise, it will be hard to generalize. If you go this route, you will have a negative ratio estimation per dataset.
Otherwise, you will need additional work to make sure that the different datasets are “aligned” and thus the trained final model is good enough. If you select this route, you could estimate one negative ratio once you have aligned the different datasets.