Imbalaced dataset

Hi, just wondering if anyone has PyTorch code for dividing the CIFAR-10 dataset into Head and Tail classes (for example if the count of classes is less than 3000 it should be considered tail else head to class) and then feeding these classes to CNN for feature extraction and then the angular variance of head and tail classes are found. Any help or suggestion will be highly appreciated.

Would you like to resample the dataset manually to create the imbalance?
The CIFAR10 dataset should be balanced initially as seen here:

dataset = datasets.CIFAR10(root='PATH')
targets = torch.tensor(dataset.targets)
print(targets.unique(return_counts=True))
> (tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), tensor([5000, 5000, 5000, 5000, 5000, 5000, 5000, 5000, 5000, 5000]))

If you want to manually resample it, you could have a look at my old tutorial.

I want to have it automated, means i only want to provide imbalance factor and it should create the dataset according to imbalance factor.

I still think you could reuse some code of this tutorial, e.g.:

class ImbalancedCIFAR10(Dataset):
    def __init__(self, imbal_class_prop, root, train, download, transform):
        self.dataset = datasets.CIFAR10(
            root=root, train=train, download=download, transform=transform)
        self.train = train
        self.imbal_class_prop = imbal_class_prop
        self.idxs = self.resample()

    def get_labels_and_class_counts(self):
        return self.labels, self.imbal_class_counts

    def resample(self):
        '''
        Resample the indices to create an artificially imbalanced dataset.
        '''
        if self.train:
            targets, class_counts = get_labels_and_class_counts(
                self.dataset.train_labels)
        else:
            targets, class_counts = get_labels_and_class_counts(
                self.dataset.test_labels)
        # Get class indices for resampling
        class_indices = [np.where(targets == i)[0] for i in range(nb_classes)]
        # Reduce class count by proportion
        self.imbal_class_counts = [
            int(count * prop)
            for count, prop in zip(class_counts, self.imbal_class_prop)
        ]
        # Get class indices for reduced class count
        idxs = []
        for c in range(nb_classes):
            imbal_class_count = self.imbal_class_counts[c]
            idxs.append(class_indices[c][:imbal_class_count])
        idxs = np.hstack(idxs)
        self.labels = targets[idxs]
        return idxs

    def __getitem__(self, index):
        img, target = self.dataset[self.idxs[index]]
        return img, target

    def __len__(self):
        return len(self.idxs)


# Create class proportions
imbal_class_prop = np.hstack(([0.1] * 5, [1.0] * 5))
train_dataset_imbalanced = ImbalancedCIFAR10(
    imbal_class_prop, root='.', train=True, download=True, transform=transform)

Note that you might need to update some parts as the code is quite old by now.

Heaps of thanks for that Ptrblck, I believe (imbal_class_prop = np.hstack(([0.1] * 5, [1.0] * 5))) divides the classes in only two distribution say either(50%,50%) or 40% and 60% but sill in both distribution number of samples will be same (I mean the all classes having 40% of the total distribution will have equal samples, however, I want to have long-tailed distribution i.e tail-classes should have a low and different number of samples.

You can provide the percentage to be used for each class.
In my example, the first 5 classes are only containing 10% of the samples, while the latter 5 classes contain all.

1 Like

I got it, however, the first five classes will still contain an equal number of samples. What I want is they all should have different samples.

Wouldn’t it work if you pass different values for each class then?
E.g.

imbal_class_prop = np.array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9])
1 Like