How to prevent overfitting of 7 class, 10000 images imbalanced class data samples?

miltonbd · July 22, 2018, 5:02am

Hello All
I have dataset 7 class, 10000 images imbalanced class data samples.
i am planning to use pnas-large as my base network.

what strategies should i follow? how much %params should be fixed?

Thanks
Regards
Milton

flauted · July 22, 2018, 5:45am

https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/

Assuming more data (1 in above) is out of the picture, my go-to’s for biased datasets are stratified sampling (3 in above) and weighted loss (6 in above). See (WeightedRandomSampler, forums) and (X-entropy loss weight parameter), respectively.

Weighted loss is a little easier to implement, so that’s usually where I start. Stratification is touchy. Often weighting so every class is drawn evenly doesn’t generalize well. Finding the sweet spot can be a pain, especially when you have several (or really just > 2) classes.

miltonbd · July 22, 2018, 7:04am

hi Dylan
Thank you so much.

miltonbd · July 24, 2018, 6:01am

class_sample_count = np.repeat(0,num_classes)  # dataset has 10 class-1 samples, 1 class-2 samples, etc.
train_Data = get_train_data()
for train_data_row in train_Data:
    index = int(train_data_row[1])
    class_sample_count[index]=class_sample_count[index]+1
class_sample_count=class_sample_count/len(train_Data)
class_sample_count=1/class_sample_count
This below worked for me.

weights = []
for train_Data_row in train_Data:
    weight=class_sample_count[int(train_data_row[1])]
    weights.append(weight)
weights=torch.Tensor(weights)
sampler = torch.utils.data.sampler.WeightedRandomSampler(weights, len(weights))
trainloader = torch.utils.data.DataLoader(train_data_set, batch_size=batch_size, sampler=sampler)