How to balance mini-batches during each epoch

Hello All;

I have a very unbalanced dataset, which I tried to balance using the following code (Class Dataset then my code):

class myDataset(Dataset):
    def __init__(self, csv_file, root_dir, target, length, transform=None):
        self.annotations = pd.read_csv(csv_file).iloc[:length,:]
        self.root_dir = root_dir
        self.transform = transform
        self.target = target
        self.length = length

    def __len__(self):
        return len(self.annotations)
    
    def __alltargets__(self):
        return self.annotations.loc[:,self.target]

    def __getitem__(self, index):
        img_path = os.path.join(self.root_dir, self.annotations.loc[index, 'image_id'])
        image = Image.open(img_path)
        image = np.array(image)

        if self.transform:
            image = self.transform(image=image)["image"]

        image = np.transpose(image, (2, 0, 1)).astype(np.float32)
        image = torch.tensor(image)# device=torch.device('cuda:0'))

        y_label = torch.tensor(int(self.annotations.loc[index, str(self.target)]))# device=torch.device('cuda:0'))

        return image, y_label

And then my code:

aug = al.Compose([
    al.RandomResizedCrop(H, W, p=0.2),
    al.Resize(H, W),
    al.Transpose(p=0.2),
    al.HorizontalFlip(p=0.5),
    al.VerticalFlip(p=0.2),
    al.augmentations.Normalize(max_pixel_value=255.0, 
                               always_apply=True, 
                               p=1.0)
])

dataset = myDataset(csv_file=LABEL_PATH,
                    root_dir=IMAGE_PATH,
                    target='gender',
                    length=LENGTH,
                    transform=aug)

l = dataset.__len__()
y = dataset.__alltargets__()

train_idx, valid_idx = train_test_split(np.arange(l), test_size=0.2, shuffle=True, stratify=y)

train_sampler = torch.utils.data.SubsetRandomSampler(train_idx)
test_sampler = torch.utils.data.SubsetRandomSampler(valid_idx)


train_loader = DataLoader(dataset=dataset, batch_size=batch_size, shuffle=False, pin_memory=True, num_workers=4, sampler=train_sampler)
test_loader = DataLoader(dataset=dataset, batch_size=batch_size, shuffle=False, pin_memory=True, num_workers=4, sampler=test_sampler)

My Question please: My actual code is just splitting evenly the classes among train and test datasets. How can I make the mini-batches balanced also ?

Thank you very much,
Habib

You could use a WeightedRandomSampler as described in this post with an example.
If you want to split the dataset in a stratified way, you could use e.g. sklearn.model_selection.train_test_split with the stratify option for the indices of the dataset (and the targets as the inputs for stratify) and use these indices in Subsets.

Thank you @ptrblck for your response.

In my code above, I’ve already used the stratify option of the method train_test_split. So my Train and Test are both well balanced.

After that, I used SubsetRandomSampler based on these well balanced indexes.

I’m struggling on how to add WeightedRandomSampler on top of SubsetRandomSampler. Thank you for your help.
As you can see, all my data are initially in the dataset placeholder, which I have to evenly split for overfit monitoring, my issue is that I don’t know how to do WeightedRandomSampler on Train dataset only.

Thank you very much,
Habib

You could create the weights using the code snippet from my previous post and replace the SubsetRandomSampler with the WeightedRandomSampler using the train_idx.
Since you are already using a subset via the train_idx, note that you would also only need to calculate the weights for these indices and use it in the WeightedRandomSampler.