I have a dataset with data that falls into one of three labels/classes : A, B, C
class A has 3000 data points
class B has 2000 data points
class C has 1000 data points.
In every epoch of my training I want my Dataset to pick all 1000 data points of C and randomly select 1000 data points of B and 1000 data points of A .
As it stands in the code below, the Dataloader will parse through all 6000 data points. How do I modify the Dataset below to ‘balance’ the dataset so that I use a 1000 points from each class at every epoch?
Edit: For clarity, the Dataset should sample randomly from A and B at every epoch. So the admitted batches from A and B will look different every epoch.
I think there is a problem in your Imbalanced_Dataset class.
In your __len__ method you are returning len(self.dataset). However, self.dataset is undefined, I assume you mean self.data_array. This however leads to another problem, as a DataLoader will pick len(self.data_array) elements of your dataset per epoch. As far as I understand your question, you want it to only pick number_of_classes*minimal_number_of_samples element per epochs.
Your __len__ method should therefore look like that:
def __len__(self):
number_of_elements_per_class = [len(np.where(self.labels==class_idx)[0]) for class_idx in np.unique(self.labels)]
return len(number_of_elements_per_class)*min(number_of_elements_per_class)
Hi @blueeagle , I’ve corrected some of the typos in the variable names.
This isn’t quite what I want. I want to ensure that for every epoch, I have 1000 randomly selected elements per class.
I don’t think changing the __len__ function will help because if in my case I change the length to 3000, it will pick indices from 0 to 2999 and completely miss class B and class C.
You can generate a random map from indices 0-2999 to the indices in A/B/C to achieve such feature
[0-999] → a list of random indices from A [0-2999]
[1000-1999] → a list of random indices from B [0-1999]
[2000-2999] → indices from C [0-999]