Class imbalance train test split

I’m interested in doing an image dataset manual split of 2 classes that are 94% and 6% of my dataset. I’m at a loss for how to do this as I’ve just been doing a subset random sampler, but I’d like to have equal minority presence in train/valid/test splits.

How can I make a list of majority+minority images and then pass into ImageFolder?

If you have stored the targets in your Dataset or can somehow precompute them, you could use scikit's train_test_split to get the training and test indices. Using these indices you can create a training and test Dataset using torch.utils.data.Subset. Here is a small dummy example:

import numpy as np
from sklearn.model_selection import train_test_split


class MyDataset(Dataset):
    def __init__(self):
        self.data = torch.randn(1000, 3, 24, 24)
        self.target = torch.cat((
            torch.zeros(940, dtype=torch.long),
            torch.ones(60, dtype=torch.long)
        ))
        
    def __getitem__(self, index):
        x = self.data[index]
        y = self.target[index]
        return x, y
        
    def __len__(self):
        return len(self.data)

dataset = MyDataset()
targets = dataset.target.numpy()
train_indices, test_indices = train_test_split(np.arange(targets.shape[0]), stratify=targets)

# Check class balance
_, train_counts = np.unique(targets[train_indices], return_counts=True)
_, test_counts = np.unique(targets[test_indices], return_counts=True)
print('Train balance {}\nTest balance {}'.format(
    train_counts[1]/train_counts[0], test_counts[1]/test_counts[0]))
> Train balance 0.06382978723404255
> Test balance 0.06382978723404255

train_dataset = Subset(dataset, indices=train_indices)
test_dataset = Subset(dataset, indices=test_indices)
2 Likes