A question on data balancing and splitting for segmentation

sebbecht · December 24, 2019, 8:27am

Dear Everyone,

I have made and labeled a small segmentation dataset with 5 classes for first proof-of-concepting and understanding the training workflow for later upscaling. With the small set I realize there are some balance issues and splitting may be tricky. Each class contains 100 100 100 100 37, labeled images respectively, as a planning error meant i didnt have time to take more images for class 5.

Each image only contain a single class. The total number of objects and average objects per image are as follows:
Total: 62119 18958 66478 5238 17851
Avg: 627 187 651 50 525

As for data balancing: I could easily create more images for class 5 which I did not get to create a 100 images. Would it be reasonable to subsample class 1 and 3 to reduce the total number of objects closer to the others? Will I have enough objects then?

As for data splitting: would random .2 split do or should I randomly subsample each class separately to ensure even distribution? will the uneven number of objects per image have negative implications?

I realize the this may not be 100% relevant to pytorch forums but it seemed like a really good place to start with a lot of helpful people (and of course im using pytorch).

Thank you and merry christmas!