How to handle imbalanced classes

Hi!
I used a WeightedRandomSampler to deal with an imbalanced dataset by using the following approach to assign weights:

weights = 1.0 / torch.tensor(counts, dtype=torch.float)

where counts is a numpy array that stores the number of samples for each class.
But when I run my dataloader, it still gives a lot of majority-class samples. Why so?

Please let me know in case you need any further information.
Thanks!

The weights tensor should contain the current weigth for each sample, not only the inverse class counts, as shown in this example.

1 Like

I did exactly the same thing as shown in that example. Here’s my code snippet:

_, counts = np.unique(label_list, return_counts=True)
    weights = 1.0 / torch.tensor(counts, dtype=torch.float)
    sample_weights = weights[label_list]
    sampler = WeightedRandomSampler(sample_weights, len(sample_weights), replacement=True)

where label_list contains the label for each sample.
Thereafter, I feed this sampler into my dataloader.

Thanks for the clarification. I clearly misunderstood how you are passing the weights.
In this case it would be working.
Could you post a (small) reproducible code snippet so that we could have a look?

Sure!
There are two functions: sampler_ and loader, where the former is called by the latter

def sampler_(labels):
    _, counts = np.unique(labels, return_counts=True)
    weights = 1.0 / torch.tensor(counts, dtype=torch.float)
    sample_weights = weights[labels]
    sampler = WeightedRandomSampler(sample_weights, len(sample_weights), replacement=True)
    return sampler
def loader(data_dir, transform, train_split=0.75):
    images, labels, _ = parse_data(data_dir)
    dataset = ImageDataset(imgages, labels, transform)
    dataset_size = len(dataset)
    indices = list(range(dataset_size))
    np.random.shuffle(indices) # shuffle the dataset before splitting into train and val
    split = int(np.floor(train_split * dataset_size))
    train_indices, val_indices = indices[:split], indices[split:]
    train_labels = [labels[x] for x in train_indices]
    val_labels = [labels[x] for x in val_indices]
    train_sampler, val_sampler = sampler_(train_labels), sampler_(val_labels)
    trainloader = DataLoader(dataset, sampler=train_sampler)
    valloader = DataLoader(dataset, sampler=val_sampler)
    return trainloader, valloader
for (feats, labels) in trainloader:
    print(labels)
Output: tensor([5, 5, 5, 5, 6, 5, 5, 6, 8, 5, 6, 5, 5, 5, 6, 5, 5, 6, 5, 5, 5, 5, 5, 5,
        6, 5, 6, 5, 0, 5, 5, 6])
tensor([5, 5, 5, 5, 5, 6, 5, 5, 5, 1, 5, 5, 5, 5, 5, 5, 5, 6, 5, 5, 5, 5, 5, 5,
        5, 5, 5, 5, 6, 5, 5, 5])
tensor([5, 6, 5, 5, 5, 5, 5, 6, 5, 5, 6, 5, 6, 5, 5, 5, 5, 5, 6, 5, 5, 6, 5, 5,
        5, 6, 5, 5, 0, 5, 5, 5])
and so on (where 5 is the class in majority).

Please let me know your take on this.
Thanks!

The correspondence between the dataset splits and sample_weights is broken.
While train_labels and val_labels are corresponding to the shuffled indices, both samplers will just assign the weight to the data indices starting at 0 in a sequential order.

The easiest way to fix it, would be to wrap dataset in a Subset before passing them to the DataLoader:

trainloader = DataLoader(Subset(dataset, train_indices), sampler=train_sampler, batch_size=10)
valloader = DataLoader(Subset(dataset, val_indices), sampler=val_sampler, batch_size=10)
3 Likes

That worked! Thank you so much! :smiley:

1 Like

Hi @ptrblck!

I have another question:
Is it a good idea to use WeightedRandomSampling and also use the weights argument in the CrossEntropyLoss to tackle the imbalanced-dataset issue? Let me know your take on this.

Thanks! :smiley:

I’m not sure if you should combine both, so I would recommend to observe the confusion matrix (or other metrics) and try to balance the training as much as needed.

1 Like

Understood. Thanks again! :smiley:

Can you provide the sample code to wrap the dataset in a Subset ?

You would just need to warp your current Dataset in a Subset and provide the indices you would like to sample from. Something like this should work:

dataset = MyDataset()
indices = torch.arange(10)
subset = Subset(dataset, indices)

Let me know, if you need more information.

My dataset contain 3 classes with distribution for training as (array([0, 1, 2]), array([10874, 1890, 6331])) and (array([0, 1, 2]), array([4579, 838, 2766])) which i get from the train_labels and test_labels(here val and test is same).By if the try to print labels of the testloaders it contain only label of class 0 which is the maximum size class.

The code is provide below,

def sampler_(labels):
    _, counts = np.unique(labels, return_counts=True)
    weights = 1.0 / torch.tensor(counts, dtype=torch.float)
    print(counts)
    sample_weights = weights[labels]
    sampler = torch.utils.data.sampler.WeightedRandomSampler(sample_weights, len(sample_weights), replacement=True)
    return sampler


import random
data_dir = '../input/dog-age/dataset/dataset'

def load_split_train_test(datadir, valid_size = .3):
    train_transforms = transforms.Compose([transforms.RandomResizedCrop(224),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
                                       ])

    test_transforms = transforms.Compose([transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
                                      ])

    train_data = datasets.ImageFolder(datadir,       
                    transform=train_transforms)
    test_data = datasets.ImageFolder(datadir,
                    transform=test_transforms)
  
    num_train = len(train_data)
    
    labels = [sample[1] for sample in train_data.imgs]
    indices = list(range(num_train))
    random.shuffle(indices)
    
    split = int(np.floor(valid_size * num_train))
    train_idx, val_idx = indices[split:], indices[:split]
    train_labels = [labels[x] for x in train_idx]
    val_labels = [labels[x] for x in val_idx]
    
    print(np.unique(np.asarray(val_labels),return_counts=True))
    print(np.unique(np.asarray(train_labels),return_counts=True))
    train_sampler, val_sampler = sampler_(train_labels), sampler_(val_labels)
    trainloader = torch.utils.data.DataLoader(train_data,sampler=train_sampler, batch_size=64)
    testloader = torch.utils.data.DataLoader(test_data,sampler=val_sampler, batch_size=64)
    return trainloader, testloader

trainloader, testloader = load_split_train_test(data_dir, .3)
print(trainloader.dataset.classes)

I am using:
train_set, val_set = torch.utils.data.random_split(trainset, [train_len, test_len])
and I have 30 imbalanced classes so how can I use WeightedRandomSampler for large train_set and val_set.? Data type of my train_set & val_set is torch.utils.data.dataset.Subset

This post shows you an example of splitting the indices and using a WeightedRandomSampler. Let me know, if this works for you.

my class labels are string type so I am using WeightedRandomSampler like this:

def sampler_(labels):
    _, counts = np.unique(labels, return_counts=True) 
    weights = 1.0 / torch.tensor(counts, dtype=torch.float)
    sample_weights = weights[[list(_).index(x) for x in labels]]
    print(f'sample={sample_weights}')
    sampler = torch.utils.data.sampler.WeightedRandomSampler(sample_weights, len(sample_weights), replacement=True)
    return sampler

labels_list=['a','b','c','a','f','c'...] 
"""labels_list length =length of dataset """
train_set, val_set = torch.utils.data.random_split(dataset, [train_len, test_len])
train_labels = [labels_list[x] for x in train_set.indices]
val_labels = [labels_list[x] for x in val_set.indices]
train_sampler, val_sampler = sampler_(train_labels), sampler_(val_labels)

I hope I am doing right, please let me know if I am doing something wrong

What’s your use case to use strings as the label data? I guess that you would have to transform them into class indices (or another numerical representation) at one point during training anyway or how are you using them to calculate e.g. the loss?

I am concatenating multiple datasets from multiple folders with random classes and before concatenate them I am appending dataset._csv to a main dataframe.

labels_list=all_dataset['class']
# all_dataset is a dataframe of complete dataset length

I am converting it into indices in sample_weights and I am getting loss by passing train_set and val_set along with weight sampler in model fitting

This code creates only a train_loader. How would we create a test_loader for test dataset? Is there any sample code to refer to? Because that would require specifying which samples have already been used in training dataloader, per my understanding.

I don’t know which code snippet you are referring to, but this post shows how to use the indices for the training and validation splits to create separate DataLoaders.