Balanced Sampling between classes with torchvision DataLoader

lyan62 · November 17, 2018, 7:33pm

Hi, I have a very very unbalanced dataset as yours, were you able to improve results with this weighted sampler?

kuzand · January 17, 2019, 1:25pm

Shouldn’t it be weight_per_class[i] = 1/float(count[i]) instead of weight_per_class[i] = N/float(count[i]) ?

AbhishekSen · February 17, 2019, 5:04am

Let’s say I have a minibatch size of 16, then how will the dataloader use the WeightedRandomSampler, because this sampler when called with “iter” function is returning an iterator for a list of len(weights), but I need only 16 how are things working out?

InnovArul · February 17, 2019, 9:02am

this might be relevant :

msrdinesh · June 13, 2019, 9:13am

Hey @InnovArul Did you try to implement ImbalancedDatasetSampler, I could not find anything about it in torch documentation

SudhakarSekar17 · August 2, 2019, 3:17pm

@ptrblck

If I keep the shuffle = True in train loader, I am getting the below error

raise ValueError('sampler option is mutually exclusive with ’
ValueError: sampler option is mutually exclusive with shuffle

ptrblck · August 2, 2019, 3:22pm

If you provide a sampler to the DataLoader, you cannot specify shuffle anymore, as the sampler is not responsible for creating the indices and thus shuffling.
There are a few samplers, which enable shuffling, e.g. SubsetRandomSampler, WeightedRandomSampler.

Griffintaur · August 18, 2019, 1:35pm

@ptrblck I want to clarify one point regarding the WeightedRandomSampler.
While it is oversampling the minority class it is also undersampling the majority class .
Lets say i have 100 images of classA and 900 images of classB
Then dataloader length will be 1000. and when we will iterate in minibatches it will ensure equal distribution thus approx 500 images of class A and 500 images of classB will be used for training.
Can’t we say it is oversampling the minority but undersampling the majority in dataset?

ptrblck · August 18, 2019, 2:46pm

You could assume this, if you use the described setup.

However, you could e.g. specify replacement=False, which will return unique num_samples.
The over/undersampling also depends on the specified weights, i.e. the WeightedRandomSampler does not automatically produce equal class distributions in each batch, but you are free to specify the weights you need.

Wendell_Philips · August 30, 2019, 6:21am

Hello, I can’t sepcify shuffle=True. When I come to the firt batch of data, all the target is 0. So the train loader can’t be use.
I use WeightedRandomSampler.

ptrblck · August 30, 2019, 12:34pm

WeightedRandomSampler will sample the elements based on the passed weights.
Note that you should provide a weight value for each sample in your Dataset.
If the batches only contain zero targets, you might need to adjust the weights for these class samples.

Griffintaur · October 30, 2019, 4:12pm

@ptrblck @smth I have total of 910 images under training 10 from class A and 900 from class B. Now when I counting the no of images under each class for each epoch it is not consistent. and The count of Images under class A is around 600 and around 900 for class B, which is way above 10 and 900. I am not using any Weight-random sampler. just general Image Folder and dataloader.

ptrblck · October 30, 2019, 4:16pm

Are you using the default sampler?
If you are using ImageFolder and just wrap it in a DataLoader (without any specific arguments), each sample of your dataset should be returned once in an epoch.

Could you post your code to create the ImageFolder and DataLoader?

Griffintaur · October 30, 2019, 4:21pm

@ptrblck code that I used.

image_datasets = {
     x: datasets.ImageFolder(
         os.path.join(data_dir, x), 
         transform=data_transforms[x]
     )
     for x in [TRAIN]
 }
 
 dataset = image_datasets[TRAIN]
 batch_size = 32
 

 dataloaders = {
     x: torch.utils.data.DataLoader(
         image_datasets[x], batch_size=32,
         num_workers=4, shuffle = True
     )
     for x in [TRAIN]
 }

Griffintaur · October 30, 2019, 4:24pm

@ptrblck the output when I ran the code:

Loaded 910 images under train
Classes: 
['trainA', 'trainB']
Loaded 600 images under val
Loaded 600 images under test
Classes under test: 
['testA', 'testB']
Epoch 0/15
----------
Validation batch 0/75.
Epoch 0 result: 
Epoch 0, count [632 900]
Avg loss (train): 0.0818
Avg acc (train): 0.9385
Avg loss (val): 0.3042
Avg acc (val): 0.5000
----------



Epoch 3/15
---------

Epoch 3 result: 
Epoch 3, count [618 900]
Avg loss (train): 0.0208
Avg acc (train): 0.9758
Avg loss (val): 0.3616
Avg acc (val): 0.5050

ptrblck · October 30, 2019, 5:58pm

Thanks for the code!
Could you post additionally, how you are using the DataLoader and how you are counting the number of samples?

Griffintaur · October 30, 2019, 6:44pm

@ptrblck

for i, data in enumerate(dataloaders[TRAIN]):
            inputs, labels = data
            classes, batch_count = np.unique(labels.cpu().numpy(), return_counts=True)
            epoch_count +=batch_count

ptrblck · November 3, 2019, 10:23pm

I’m not sure, what’s going on, as a dummy code using FakeData works as expected.
Your code should work, if each batch contains at least one sample of both classes.
Otherwise, it a batch only contains a single class, it should crash.

Could you print the outputs of batch_counts and classes for a couple of steps?
I’m currently unsure, how to debug this issue further.

Jordan_Howell · November 14, 2019, 6:04pm

Hello. How do I install the Imbalanceddatasample module?

InnovArul · November 14, 2019, 11:50pm

You can just copy the sampler.py file to your directory and use it.