How to handle imbalanced classes

Understood. Thanks again! :smiley:

Can you provide the sample code to wrap the dataset in a Subset ?

You would just need to warp your current Dataset in a Subset and provide the indices you would like to sample from. Something like this should work:

dataset = MyDataset()
indices = torch.arange(10)
subset = Subset(dataset, indices)

Let me know, if you need more information.

My dataset contain 3 classes with distribution for training as (array([0, 1, 2]), array([10874, 1890, 6331])) and (array([0, 1, 2]), array([4579, 838, 2766])) which i get from the train_labels and test_labels(here val and test is same).By if the try to print labels of the testloaders it contain only label of class 0 which is the maximum size class.

The code is provide below,

def sampler_(labels):
    _, counts = np.unique(labels, return_counts=True)
    weights = 1.0 / torch.tensor(counts, dtype=torch.float)
    print(counts)
    sample_weights = weights[labels]
    sampler = torch.utils.data.sampler.WeightedRandomSampler(sample_weights, len(sample_weights), replacement=True)
    return sampler


import random
data_dir = '../input/dog-age/dataset/dataset'

def load_split_train_test(datadir, valid_size = .3):
    train_transforms = transforms.Compose([transforms.RandomResizedCrop(224),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
                                       ])

    test_transforms = transforms.Compose([transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
                                      ])

    train_data = datasets.ImageFolder(datadir,       
                    transform=train_transforms)
    test_data = datasets.ImageFolder(datadir,
                    transform=test_transforms)
  
    num_train = len(train_data)
    
    labels = [sample[1] for sample in train_data.imgs]
    indices = list(range(num_train))
    random.shuffle(indices)
    
    split = int(np.floor(valid_size * num_train))
    train_idx, val_idx = indices[split:], indices[:split]
    train_labels = [labels[x] for x in train_idx]
    val_labels = [labels[x] for x in val_idx]
    
    print(np.unique(np.asarray(val_labels),return_counts=True))
    print(np.unique(np.asarray(train_labels),return_counts=True))
    train_sampler, val_sampler = sampler_(train_labels), sampler_(val_labels)
    trainloader = torch.utils.data.DataLoader(train_data,sampler=train_sampler, batch_size=64)
    testloader = torch.utils.data.DataLoader(test_data,sampler=val_sampler, batch_size=64)
    return trainloader, testloader

trainloader, testloader = load_split_train_test(data_dir, .3)
print(trainloader.dataset.classes)

I am using:
train_set, val_set = torch.utils.data.random_split(trainset, [train_len, test_len])
and I have 30 imbalanced classes so how can I use WeightedRandomSampler for large train_set and val_set.? Data type of my train_set & val_set is torch.utils.data.dataset.Subset

This post shows you an example of splitting the indices and using a WeightedRandomSampler. Let me know, if this works for you.

my class labels are string type so I am using WeightedRandomSampler like this:

def sampler_(labels):
    _, counts = np.unique(labels, return_counts=True) 
    weights = 1.0 / torch.tensor(counts, dtype=torch.float)
    sample_weights = weights[[list(_).index(x) for x in labels]]
    print(f'sample={sample_weights}')
    sampler = torch.utils.data.sampler.WeightedRandomSampler(sample_weights, len(sample_weights), replacement=True)
    return sampler

labels_list=['a','b','c','a','f','c'...] 
"""labels_list length =length of dataset """
train_set, val_set = torch.utils.data.random_split(dataset, [train_len, test_len])
train_labels = [labels_list[x] for x in train_set.indices]
val_labels = [labels_list[x] for x in val_set.indices]
train_sampler, val_sampler = sampler_(train_labels), sampler_(val_labels)

I hope I am doing right, please let me know if I am doing something wrong

What’s your use case to use strings as the label data? I guess that you would have to transform them into class indices (or another numerical representation) at one point during training anyway or how are you using them to calculate e.g. the loss?

I am concatenating multiple datasets from multiple folders with random classes and before concatenate them I am appending dataset._csv to a main dataframe.

labels_list=all_dataset['class']
# all_dataset is a dataframe of complete dataset length

I am converting it into indices in sample_weights and I am getting loss by passing train_set and val_set along with weight sampler in model fitting

This code creates only a train_loader. How would we create a test_loader for test dataset? Is there any sample code to refer to? Because that would require specifying which samples have already been used in training dataloader, per my understanding.

I don’t know which code snippet you are referring to, but this post shows how to use the indices for the training and validation splits to create separate DataLoaders.

Hi ptrblck, thanks for your kind explanation on WeightedRandomSampler. I am trying to use it to resample the unbalanced dataset, but the dataloader output the most frequent class without any improvement. My dataset loads a dict-like batch (e.g., data[‘image’]) and it does not include label information. here’s my snippets for weights:

  weight_list = [1./len(np.where(submeta['commands']==i)[0]) for i in range(3)]
  # weight_list = [1, 1, 1]
  # submeta['commands'] is a sequence of labels of each index
  submeta['weights'] = torch.from_numpy(np.array([weight_list[i] for i in submeta['commands']]))

How do you measure the imbalance if the DataLoader doesn’t output any label information?

I have a separable variable that stores the label. There are three classes 0, 1, 2 stored in submeta['commands']=[0,1,0,...,2]. So I want to know if it should be included in the dataloader for sampler to work? Or it is ok as long as the weights calculated from the label have shared index
with data? Hope I get myself across, and thanks for your kind reply!

You don’t need to return these labels, since you already created sample weights for each index.
If weight_list is [1, 1, 1] before you are assigning it to each sample index, the weight calculation is wrong or your dataset is perfectly balanced.

Really thank you ptrblck! A late update: there is some problem in my weight index (relative vs absolute index), I followed your advice from another similar topic and check the index carefully and bang! Thanks :smile:

As a follow-up question, I wonder how to implement class-balanced sampling when using BCELoss for minibatch updates. The first challenge is the shuffling of classes VS fixed weight, but reimplementing the DataLoader can fix it. I am more uncertain about the second challenge, if I do not set drop_last=True, is there a way to keep the class weights for the last batch which might have fewer samples? (Same problem for CrossEntropyLoss)

BCEWithLogitsLoss seems to be better because it allows for class weights rather than sample weights.

I guess you’re wrong !
defining :

sampler = WeightedRandomSampler(samples_weight, len(samples_weight))
train_dataset = torch.utils.data.TensorDataset(data, target)
train_loader = DataLoader(
    train_dataset, batch_size=bs, num_workers=1, sampler=sampler)

without replacement = True , acts same as :

train_loader = DataLoader(
    train_dataset, batch_size=bs, num_workers=1, shuffle=True)

and I cannot figure out how you actually handled unbalanced dataset ?!

No, a weighted sampler without replacement will not act as a random sampler which just shuffles.
It will still use the weights to draw the samples and will thus select samples with a higher weight earlier. However, it will not be able to “oversample” since it cannot re-draw the same sample is replacement=False is used.

I have posted several code snippets in this thread or have linked to other threads which give you a minimal, executable code. E.g. you can run this one and will see that each batch will be balanced when replacement=True. I don’t know where I claimed otherwise.

I also see that you’ve actually responded to the post sharing the code snippet, so did you actually run it?

(I’m @dieas93 with different account )
Here is the output of your code with replacement = False :

target train 0/1: 900/100
batch index 0, 0/1: 56/44
batch index 1, 0/1: 68/32
batch index 2, 0/1: 83/17
batch index 3, 0/1: 95/5
batch index 4, 0/1: 98/2
batch index 5, 0/1: 100/0
batch index 6, 0/1: 100/0
batch index 7, 0/1: 100/0
batch index 8, 0/1: 100/0
batch index 9, 0/1: 100/0

and with replacement = True :

target train 0/1: 900/100
batch index 0, 0/1: 47/53
batch index 1, 0/1: 47/53
batch index 2, 0/1: 53/47
batch index 3, 0/1: 51/49
batch index 4, 0/1: 44/56
batch index 5, 0/1: 59/41
batch index 6, 0/1: 50/50
batch index 7, 0/1: 49/51
batch index 8, 0/1: 45/55
batch index 9, 0/1: 47/53

with no replacement as you said , the network sees samples with higher weight earlier but does it really affect accuracy ? model sees all samples anyway which means samples with lower weight will have more impact on accuracy ( as we can see from outputs ).
even with replacement there is no guarantee that all samples come to play ! based on mathematical view point , higher epochs is needed to make sure model sees all samples at least once.
I think to be 100% sure that the data is balanced in each batch one should define custom Batch Sampler and of course there is no good docs on how to define one !
correct me if I’m wrong