Handle an imbalanced dataset

I have an imbalanced dataset for a classification task. There are 7 classes for output.

Each proportion is around 10%, 3%, 6%, 39%, 16%, 6%, 20%.

I use cross-entropy loss and I have tired weighted loss as [2, 8.5, 3.5, 4, 1.5, 4, 8], but it does not help.

The loss curve is clearly overfitting. I could achieve around 99% accuracy for training but, get an unchanged 70% for validation.

What else really helpful could I do?

Hi, are you using accuracy as your metric? if so, please use f1-score as your metric and check train and valid set scores. As accuracy is not the right one to use for imbalanced data sets. Hope this helps

1 Like

You can use python imgaug tool on your dataset and easily augment the dataset as you want. Also, you can use the weight parameter on your optimizer and that will affect your loss curve. Please let me know if it resolves your problem.

Hi. I understand that F1-score would be able to make the metric more meaningful. However, would it be helpful for my neural network’s training?

Yes. It will. For imgaug, it depends on your augmentation on the image. However, it may not work well on your dataset. For weights, If you don’t pass the weights the loss value, your loss value doesn’t really catch the real loss and won’t learn anything. Have you tried?