Balance an imbalanced Dataset

datalord · March 21, 2022, 10:15am

I have a dataset with data that falls into one of three labels/classes : A, B, C

class A has 3000 data points
class B has 2000 data points
class C has 1000 data points.

In every epoch of my training I want my Dataset to pick all 1000 data points of C and randomly select 1000 data points of B and 1000 data points of A .

As it stands in the code below, the Dataloader will parse through all 6000 data points. How do I modify the Dataset below to ‘balance’ the dataset so that I use a 1000 points from each class at every epoch?

A = np.random.rand(3000)
B = np.random.rand(2000)
C = np.random.rand(1000)

data_array = np.concatenate([A,B,C])
labels = np.concatenate([np.zeros_like(A), np.ones_like(B), 2*np.ones_like(C)])

class Imbalanced_Dataset(Dataset):
    def __init__(self, data_array, labels):
        self.data_array = data_array
        self.labels = labels
    def __len__(self):
        return len(self.data_array)

    def __getitem__(self, index):

        data_point = self.data_array[index]
        label = self.labels[index]

        return data_point,label

Edit: For clarity, the Dataset should sample randomly from A and B at every epoch. So the admitted batches from A and B will look different every epoch.

blueeagle · March 21, 2022, 10:28am

I think there is a problem in your Imbalanced_Dataset class.
In your __len__ method you are returning len(self.dataset). However, self.dataset is undefined, I assume you mean self.data_array. This however leads to another problem, as a DataLoader will pick len(self.data_array) elements of your dataset per epoch. As far as I understand your question, you want it to only pick number_of_classes*minimal_number_of_samples element per epochs.
Your __len__ method should therefore look like that:

def __len__(self):
        number_of_elements_per_class = [len(np.where(self.labels==class_idx)[0]) for class_idx in np.unique(self.labels)]
        return len(number_of_elements_per_class)*min(number_of_elements_per_class)

datalord · March 21, 2022, 10:41am

Hi @blueeagle , I’ve corrected some of the typos in the variable names.

This isn’t quite what I want. I want to ensure that for every epoch, I have 1000 randomly selected elements per class.

I don’t think changing the __len__ function will help because if in my case I change the length to 3000, it will pick indices from 0 to 2999 and completely miss class B and class C.

ejguan · March 22, 2022, 3:38pm

You can generate a random map from indices 0-2999 to the indices in A/B/C to achieve such feature
[0-999] → a list of random indices from A [0-2999]
[1000-1999] → a list of random indices from B [0-1999]
[2000-2999] → indices from C [0-999]

datalord · March 23, 2022, 9:52am

This is a very creative solution. Thanks!