How to handle imbalanced classes

dhananjai_sharma · May 30, 2019, 4:04pm

Hi!
I used a WeightedRandomSampler to deal with an imbalanced dataset by using the following approach to assign weights:

weights = 1.0 / torch.tensor(counts, dtype=torch.float)

where counts is a numpy array that stores the number of samples for each class.
But when I run my dataloader, it still gives a lot of majority-class samples. Why so?

Please let me know in case you need any further information.
Thanks!

ptrblck · May 30, 2019, 4:12pm

The weights tensor should contain the current weigth for each sample, not only the inverse class counts, as shown in this example.

dhananjai_sharma · May 30, 2019, 4:17pm

I did exactly the same thing as shown in that example. Here’s my code snippet:

_, counts = np.unique(label_list, return_counts=True)
    weights = 1.0 / torch.tensor(counts, dtype=torch.float)
    sample_weights = weights[label_list]
    sampler = WeightedRandomSampler(sample_weights, len(sample_weights), replacement=True)

where label_list contains the label for each sample.
Thereafter, I feed this sampler into my dataloader.

ptrblck · May 30, 2019, 4:23pm

Thanks for the clarification. I clearly misunderstood how you are passing the weights.
In this case it would be working.
Could you post a (small) reproducible code snippet so that we could have a look?

dhananjai_sharma · May 30, 2019, 4:45pm

Sure!
There are two functions: sampler_ and loader, where the former is called by the latter

def sampler_(labels):
    _, counts = np.unique(labels, return_counts=True)
    weights = 1.0 / torch.tensor(counts, dtype=torch.float)
    sample_weights = weights[labels]
    sampler = WeightedRandomSampler(sample_weights, len(sample_weights), replacement=True)
    return sampler

def loader(data_dir, transform, train_split=0.75):
    images, labels, _ = parse_data(data_dir)
    dataset = ImageDataset(imgages, labels, transform)
    dataset_size = len(dataset)
    indices = list(range(dataset_size))
    np.random.shuffle(indices) # shuffle the dataset before splitting into train and val
    split = int(np.floor(train_split * dataset_size))
    train_indices, val_indices = indices[:split], indices[split:]
    train_labels = [labels[x] for x in train_indices]
    val_labels = [labels[x] for x in val_indices]
    train_sampler, val_sampler = sampler_(train_labels), sampler_(val_labels)
    trainloader = DataLoader(dataset, sampler=train_sampler)
    valloader = DataLoader(dataset, sampler=val_sampler)
    return trainloader, valloader

for (feats, labels) in trainloader:
    print(labels)
Output: tensor([5, 5, 5, 5, 6, 5, 5, 6, 8, 5, 6, 5, 5, 5, 6, 5, 5, 6, 5, 5, 5, 5, 5, 5,
        6, 5, 6, 5, 0, 5, 5, 6])
tensor([5, 5, 5, 5, 5, 6, 5, 5, 5, 1, 5, 5, 5, 5, 5, 5, 5, 6, 5, 5, 5, 5, 5, 5,
        5, 5, 5, 5, 6, 5, 5, 5])
tensor([5, 6, 5, 5, 5, 5, 5, 6, 5, 5, 6, 5, 6, 5, 5, 5, 5, 5, 6, 5, 5, 6, 5, 5,
        5, 6, 5, 5, 0, 5, 5, 5])
and so on (where 5 is the class in majority).

Please let me know your take on this.
Thanks!

ptrblck · May 30, 2019, 4:58pm

The correspondence between the dataset splits and sample_weights is broken.
While train_labels and val_labels are corresponding to the shuffled indices, both samplers will just assign the weight to the data indices starting at 0 in a sequential order.

The easiest way to fix it, would be to wrap dataset in a Subset before passing them to the DataLoader:

trainloader = DataLoader(Subset(dataset, train_indices), sampler=train_sampler, batch_size=10)
valloader = DataLoader(Subset(dataset, val_indices), sampler=val_sampler, batch_size=10)

dhananjai_sharma · May 30, 2019, 6:21pm

That worked! Thank you so much!

dhananjai_sharma · May 30, 2019, 7:05pm

Hi @ptrblck!

I have another question:
Is it a good idea to use WeightedRandomSampling and also use the weights argument in the CrossEntropyLoss to tackle the imbalanced-dataset issue? Let me know your take on this.

Thanks!

ptrblck · June 4, 2019, 10:39pm

I’m not sure if you should combine both, so I would recommend to observe the confusion matrix (or other metrics) and try to balance the training as much as needed.

dhananjai_sharma · June 4, 2019, 11:51pm

Understood. Thanks again!

riktimmondal · August 25, 2019, 3:23pm

Can you provide the sample code to wrap the dataset in a Subset ?

ptrblck · August 25, 2019, 3:39pm

You would just need to warp your current Dataset in a Subset and provide the indices you would like to sample from. Something like this should work:

dataset = MyDataset()
indices = torch.arange(10)
subset = Subset(dataset, indices)

Let me know, if you need more information.

riktimmondal · August 25, 2019, 4:02pm

My dataset contain 3 classes with distribution for training as (array([0, 1, 2]), array([10874, 1890, 6331])) and (array([0, 1, 2]), array([4579, 838, 2766])) which i get from the train_labels and test_labels(here val and test is same).By if the try to print labels of the testloaders it contain only label of class 0 which is the maximum size class.

The code is provide below,

def sampler_(labels):
    _, counts = np.unique(labels, return_counts=True)
    weights = 1.0 / torch.tensor(counts, dtype=torch.float)
    print(counts)
    sample_weights = weights[labels]
    sampler = torch.utils.data.sampler.WeightedRandomSampler(sample_weights, len(sample_weights), replacement=True)
    return sampler


import random
data_dir = '../input/dog-age/dataset/dataset'

def load_split_train_test(datadir, valid_size = .3):
    train_transforms = transforms.Compose([transforms.RandomResizedCrop(224),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
                                       ])

    test_transforms = transforms.Compose([transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
                                      ])

    train_data = datasets.ImageFolder(datadir,       
                    transform=train_transforms)
    test_data = datasets.ImageFolder(datadir,
                    transform=test_transforms)
  
    num_train = len(train_data)
    
    labels = [sample[1] for sample in train_data.imgs]
    indices = list(range(num_train))
    random.shuffle(indices)
    
    split = int(np.floor(valid_size * num_train))
    train_idx, val_idx = indices[split:], indices[:split]
    train_labels = [labels[x] for x in train_idx]
    val_labels = [labels[x] for x in val_idx]
    
    print(np.unique(np.asarray(val_labels),return_counts=True))
    print(np.unique(np.asarray(train_labels),return_counts=True))
    train_sampler, val_sampler = sampler_(train_labels), sampler_(val_labels)
    trainloader = torch.utils.data.DataLoader(train_data,sampler=train_sampler, batch_size=64)
    testloader = torch.utils.data.DataLoader(test_data,sampler=val_sampler, batch_size=64)
    return trainloader, testloader

trainloader, testloader = load_split_train_test(data_dir, .3)
print(trainloader.dataset.classes)

monster · June 4, 2020, 1:17pm

I am using:
train_set, val_set = torch.utils.data.random_split(trainset, [train_len, test_len])
and I have 30 imbalanced classes so how can I use WeightedRandomSampler for large train_set and val_set.? Data type of my train_set & val_set is torch.utils.data.dataset.Subset

ptrblck · June 5, 2020, 12:56am

This post shows you an example of splitting the indices and using a WeightedRandomSampler. Let me know, if this works for you.

monster · June 5, 2020, 10:56am

my class labels are string type so I am using WeightedRandomSampler like this:

def sampler_(labels):
    _, counts = np.unique(labels, return_counts=True) 
    weights = 1.0 / torch.tensor(counts, dtype=torch.float)
    sample_weights = weights[[list(_).index(x) for x in labels]]
    print(f'sample={sample_weights}')
    sampler = torch.utils.data.sampler.WeightedRandomSampler(sample_weights, len(sample_weights), replacement=True)
    return sampler

labels_list=['a','b','c','a','f','c'...] 
"""labels_list length =length of dataset """
train_set, val_set = torch.utils.data.random_split(dataset, [train_len, test_len])
train_labels = [labels_list[x] for x in train_set.indices]
val_labels = [labels_list[x] for x in val_set.indices]
train_sampler, val_sampler = sampler_(train_labels), sampler_(val_labels)

I hope I am doing right, please let me know if I am doing something wrong

ptrblck · June 6, 2020, 5:32am

What’s your use case to use strings as the label data? I guess that you would have to transform them into class indices (or another numerical representation) at one point during training anyway or how are you using them to calculate e.g. the loss?

monster · June 6, 2020, 7:04am

I am concatenating multiple datasets from multiple folders with random classes and before concatenate them I am appending dataset._csv to a main dataframe.

labels_list=all_dataset['class']
# all_dataset is a dataframe of complete dataset length

I am converting it into indices in sample_weights and I am getting loss by passing train_set and val_set along with weight sampler in model fitting

Brinda · June 30, 2022, 10:24pm

This code creates only a train_loader. How would we create a test_loader for test dataset? Is there any sample code to refer to? Because that would require specifying which samples have already been used in training dataloader, per my understanding.

ptrblck · June 30, 2022, 10:41pm

I don’t know which code snippet you are referring to, but this post shows how to use the indices for the training and validation splits to create separate DataLoaders.