Handle an imbalanced dataset

xinqi · February 23, 2020, 3:15pm

I have an imbalanced dataset for a classification task. There are 7 classes for output.

Each proportion is around 10%, 3%, 6%, 39%, 16%, 6%, 20%.

I use cross-entropy loss and I have tired weighted loss as [2, 8.5, 3.5, 4, 1.5, 4, 8], but it does not help.

The loss curve is clearly overfitting. I could achieve around 99% accuracy for training but, get an unchanged 70% for validation.

What else really helpful could I do?

sai_m · February 23, 2020, 4:56pm

Hi, are you using accuracy as your metric? if so, please use f1-score as your metric and check train and valid set scores. As accuracy is not the right one to use for imbalanced data sets. Hope this helps

anilg · February 23, 2020, 7:18pm

You can use python imgaug tool on your dataset and easily augment the dataset as you want. Also, you can use the weight parameter on your optimizer and that will affect your loss curve. Please let me know if it resolves your problem.

xinqi · February 24, 2020, 3:56am

Hi. I understand that F1-score would be able to make the metric more meaningful. However, would it be helpful for my neural network’s training?

anilg · February 25, 2020, 5:16pm

Hi
Yes. It will. For imgaug, it depends on your augmentation on the image. However, it may not work well on your dataset. For weights, If you don’t pass the weights the loss value, your loss value doesn’t really catch the real loss and won’t learn anything. Have you tried?