How to approach an imbalanced image dataset for mobile image classification app?

Hello, I’m a total newbie in image classification and I got really confused from all the available techniques for each step of the implementation of an image classification app - from the data preparation to model deployment.
My idea is to learn by doing some real project. I chose to implement a mobile app that classifies an image from the phone camera. For the purpose of mobile inference I chose to test and decide between two pretrained models (transfer learning) - MobileNetv2 and Yolo 11m-cls.
The problem is that my dataset is small and imbalanced. I have one class with 2567 images, another with 1180 images, another with 880, another with 192, etc and the smallest class has 69 images. I’m really stuck figuring out how to approach this case - transfer learning on imbalanced multi-class image data for mobile app.
Should I use only data augmentation in the data prep stage or combine it with some other method in another stage (training, evaluation)?
Or data augmentation is not suitable in this case and maybe I should use some other approach? I read about stratified split and stratified k fold but I don’t understand if I should apply only one of them, should I combine both of them or should I combine one of them with something else? Or should I use totally different approach?
I just can’t get clear picture how to do this and what to do in each stage.
Any advice appreciated, thank you!

Hi!

Most real-world applications have the issue of imbalanced data. There are a few things you can try to increase the performance of the model.

Class weighting
You can apply a weight to each class that is inversely proportional to the number of images of that class in the dataset. It is often better to also normalize this weight. This will basically penalize the model more if it misclassifies an image with a class that is less common and tries to force the model to get these classes right as well.

Data augmentation
You can try to “generate” more samples by creating augmented forms, like rotations, flipping, noise, etc. This will try to emulate you having more data, while this is actually not the case. However, with more complex models it is hard to “fool” them. Furthermore, the model might also take the easy way out then and just detect images with augmentation and images without instead of the actual visual appearance of your objects. Nonetheless, it is always good to apply some level of augmentation to your training images to get better model generalization.

Synthetic data
With the current state-of-the-art generetive models, is it possible to generate high quality synthetic images for the classes where your data is limited. You can either generate images from scratch, or even use your current images to generate new variants.

Custom loss
Often some classes are confused by the model more often than others. They may look similar or you just do not have enough data for the model to find the distinction. In that case you can also create a custom loss function that penalizes the model more when it makes a classification mistake between these two classes.

Stratification can help to properly split the imbalanced dataset into a training and validation set, making sure that you have equal distributions of samples in either split. You can also take a random subset, which should often be sufficient as well, but you do have to check if the distributions of the splits is ok afterwards (stratification does this for you). Finally, make sure that there is no data leakage between the splits as this will give skewed performance metrics. This can be an issues with stratification if images of similar objects are next to each other; stratification often then splits them into training and validation which will give you data leakage.

Thank you! If I use class weighting is it ok to combine it with custom loss? Or I should use only one of the proposed techniques?
Also if I:

  1. First calculate the class weights from the complete dataset and
  2. Then do a stratified train/val/test split,
  3. Then augment only the train classes,
  4. Apply the calculated class weights from 1 to train the model - is that ok, when the train data is already augmented?

Here’s how I’d structure your experiment:

  1. Start with stratified train-validation-test split (70-15-15)
  2. Apply balanced augmentation to training data only
  3. Implement class weights in your loss function
  4. Train both MobileNetV2 and YOLO11m-cls with the same setup
  5. Use stratified 5-fold CV on your training set for hyperparameter tuning
  6. Evaluate on your held-out test set using multiple metrics
  7. Test inference speed on actual mobile device

For your mobile app, also consider:

  • Model quantization after training to reduce size
  • Test-time augmentation (TTA) for better predictions at slight computational cost
  • Confidence thresholds - maybe refuse to classify if confidence is too low

Remember, with your class having only 69 images, you’re pushing the limits of what’s possible. If performance on minority classes remains poor, you might need to either collect more data for those classes or consider merging similar minority classes if semantically appropriate.

The class weighting can be independently applied from the augmentations. It is just a matter of how large the penalty is for a model to misclassify that class. It does not matter if the data has been augmented or not.

As long as the distribution of classes of the full dataset and training dataset is mostly equal, it should be fine to use the same class weights.

@Hamza_Javaid, also added some nice steps that complement mine!

However, I would start simple with for instance only class weights and work your way up by adding more complexity over time, like a custom loss function to see if this would improve your performance. You also don’t need a custom loss yet if you just use class weights; this is already implemented in the standard CrossEntropyLoss of PyTorch (see CrossEntropyLoss — PyTorch 2.7 documentation).

Thank you very much for the clear and structured advice. That makes things a lot more perceptible and answers some of the questions that I had - for example one of the things that I wondered was if I do a stratified dataset split can I apply a stratified k-fold cross validation (points 1 and 5 from your advice are positive about that combination).
Thank you again!

Thank you for your clear and descriptive advices! I can see a light in the tunnel now.

Hello again,

I’m posting update of my progress so far.

Initially I decided to use all my classes in my dataset, even those that have <100 images.

I did stratified train/va/test split - 70/15/15%. Only the training set is augmented with resize 240*240, rotation, horizontal flip, sharpness adjustment and normalized. The validation and test set have only resize. The dataloaders are created with batch size 32, suffling only for the training set. I froze all my mobilenetv2 layers but changed the classifier to output only 14 classes. I used weighted cross entropy loss function by calculating the weights of the classes in a simple way: 1/num images per class. For accuracy I calculated balanced accuracy because I read that it is suitable for imbalanced datasets. I ran 50, 100, 150 and 200 epochs and my validation metrics are a complete disaster. F1 score is around 0.2-3, precision and recall also gravitate around these values, Mathew’s coeff is also 0.2-3.

Then I decided to get rid of the classes that have <100 images without changing anything else and the results are still very bad.

Epoch: 50 | train_loss: 0.0642 | train_acc: 0.9796 | Epoch: 50 Val Loss: 2.2054 Val Balanced Accuracy: 0.0065 Val Precision: 0.4638 Val Recall: 0.3656 Val F1 Score: 0.3111 Val Mathews corcoef: 0.3455

I’m not sure what to do. I read about synthetic data creation with variational autoencoders but I’m not sure if it will help. Another thing is to unfreeze some of the layers but I’m not sure which ones and do I have to do additional changes like adding custom layers.

Also my images are mixed - some of them are taken in a natural environment and others are in laboratory, those in the laboratory are a single vine leaf over uniform background, the natural ones are leaves only and grapes and leaves mix. Is this a bad dataset?

I’m sharing a link to the file that I use to experiment with Mobilenet Mobilenetv2_experiment - If anyone has some insight or advice what there is to change, I’ll appreciate it.

There are still many things you can try:

  • You can investigate on which samples the model has trouble making the right decision. Maybe something is wrong with the labels?
  • Furthermore, visualizing the latent space is also a nice way to spot weird outliers or incorrect labels. See t-SNE or UMAP which can be used to reduce the dimensionality of the image data to be able to plot it on a 2D scatter plot.
  • If all is well with the data and labels (you might still have a high imbalance), you can see if it is due to the model not being able to learn enough. Though, if you already get a bad performance on a simple model, using a more complex model might not have. Maybe try to train your current setup with a known dataset like MNIST and compare to the benchmarks to see if your training setup is right? You can find some nice architectures here: Models - Hugging Face. I wanted to link to paperswithcode.com, but it has been decided to take it offline unfortunately. They had a nice leader board for different applications and datasets.

Hi! I hope Im not too late to ask in this thread. You said that

The class weighting can be independently applied from the augmentations. It is just a matter of how large the penalty is for a model to misclassify that class. It does not matter if the data has been augmented or not.

and

As long as the distribution of classes of the full dataset and training dataset is mostly equal, it should be fine to use the same class weights.

Im facing the same problem as @myrmel4e83 , in train split itself I have 1 class that have 1500 sample while the smallest one only have 73 sample. I did a targeted augmentation only to smallest classes, and generate 200-ish sample because:

  1. If I generate many samples with extreme augmentation, Im afraid it will be irrelevant sample of the class
  2. If I generate many samples with either small or medium augmentation, Im afraid the differences will not be much and the model would overfit in that class

Im pretty sure targeted augmentation will change the distribution from the original train split (or maybe even the full dataset distribution).

For this case, which train split would you recommend to calculate class weight? before augmentation or after augmentation?

Thanks in advance!