Balanced Sampling between classes with torchvision DataLoader

Hi, I have a very very unbalanced dataset as yours, were you able to improve results with this weighted sampler?

Shouldn’t it be weight_per_class[i] = 1/float(count[i]) instead of weight_per_class[i] = N/float(count[i]) ?

Let’s say I have a minibatch size of 16, then how will the dataloader use the WeightedRandomSampler, because this sampler when called with “iter” function is returning an iterator for a list of len(weights), but I need only 16 how are things working out?

this might be relevant :

2 Likes

Hey @InnovArul Did you try to implement ImbalancedDatasetSampler, I could not find anything about it in torch documentation

@ptrblck

If I keep the shuffle = True in train loader, I am getting the below error

raise ValueError('sampler option is mutually exclusive with ’
ValueError: sampler option is mutually exclusive with shuffle

If you provide a sampler to the DataLoader, you cannot specify shuffle anymore, as the sampler is not responsible for creating the indices and thus shuffling.
There are a few samplers, which enable shuffling, e.g. SubsetRandomSampler, WeightedRandomSampler.

4 Likes

@ptrblck I want to clarify one point regarding the WeightedRandomSampler.
While it is oversampling the minority class it is also undersampling the majority class .
Lets say i have 100 images of classA and 900 images of classB
Then dataloader length will be 1000. and when we will iterate in minibatches it will ensure equal distribution thus approx 500 images of class A and 500 images of classB will be used for training.
Can’t we say it is oversampling the minority but undersampling the majority in dataset?

1 Like

You could assume this, if you use the described setup.

However, you could e.g. specify replacement=False, which will return unique num_samples.
The over/undersampling also depends on the specified weights, i.e. the WeightedRandomSampler does not automatically produce equal class distributions in each batch, but you are free to specify the weights you need.

Hello, I can’t sepcify shuffle=True. When I come to the firt batch of data, all the target is 0. So the train loader can’t be use.
I use WeightedRandomSampler.

WeightedRandomSampler will sample the elements based on the passed weights.
Note that you should provide a weight value for each sample in your Dataset.
If the batches only contain zero targets, you might need to adjust the weights for these class samples.

1 Like

@ptrblck @smth I have total of 910 images under training 10 from class A and 900 from class B. Now when I counting the no of images under each class for each epoch it is not consistent. and The count of Images under class A is around 600 and around 900 for class B, which is way above 10 and 900. I am not using any Weight-random sampler. just general Image Folder and dataloader.

Are you using the default sampler?
If you are using ImageFolder and just wrap it in a DataLoader (without any specific arguments), each sample of your dataset should be returned once in an epoch.

Could you post your code to create the ImageFolder and DataLoader?

@ptrblck code that I used.

image_datasets = {
     x: datasets.ImageFolder(
         os.path.join(data_dir, x), 
         transform=data_transforms[x]
     )
     for x in [TRAIN]
 }
 
 dataset = image_datasets[TRAIN]
 batch_size = 32
 

 dataloaders = {
     x: torch.utils.data.DataLoader(
         image_datasets[x], batch_size=32,
         num_workers=4, shuffle = True
     )
     for x in [TRAIN]
 }

@ptrblck the output when I ran the code:

Loaded 910 images under train
Classes: 
['trainA', 'trainB']
Loaded 600 images under val
Loaded 600 images under test
Classes under test: 
['testA', 'testB']
Epoch 0/15
----------
Validation batch 0/75.
Epoch 0 result: 
Epoch 0, count [632 900]
Avg loss (train): 0.0818
Avg acc (train): 0.9385
Avg loss (val): 0.3042
Avg acc (val): 0.5000
----------



Epoch 3/15
---------

Epoch 3 result: 
Epoch 3, count [618 900]
Avg loss (train): 0.0208
Avg acc (train): 0.9758
Avg loss (val): 0.3616
Avg acc (val): 0.5050

Thanks for the code!
Could you post additionally, how you are using the DataLoader and how you are counting the number of samples?

@ptrblck

for i, data in enumerate(dataloaders[TRAIN]):
            inputs, labels = data
            classes, batch_count = np.unique(labels.cpu().numpy(), return_counts=True)
            epoch_count +=batch_count
2 Likes

I’m not sure, what’s going on, as a dummy code using FakeData works as expected.
Your code should work, if each batch contains at least one sample of both classes.
Otherwise, it a batch only contains a single class, it should crash.

Could you print the outputs of batch_counts and classes for a couple of steps?
I’m currently unsure, how to debug this issue further.

Hello. How do I install the Imbalanceddatasample module?

You can just copy the sampler.py file to your directory and use it.