Balanced Sampling between classes with torchvision DataLoader

I tried and got this error;

Traceback (most recent call last):

File “C:\ProgramData\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py”, line 3296, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)

File “”, line 1, in
from sampler.py import ImbalancedDatasetSampler

File “C:\ProgramData\Anaconda3\lib\sampler.py”, line 7

^
SyntaxError: invalid syntax

sorry for the delay. Hope you have figured out the issue already.

It should be

from sampler import ImbalancedDatasetSampler

Thank you. I did that. I saved the script in the torch.utils.data.sampler script and it seemed to throw a _C error. I’m not sure where to save the actual s script file.

Jordan

this one works fine for me and is reasonably terse

from torch.utils.data.sampler import WeightedRandomSampler
from torch.utils.data import DataLoader, TensorDataset

def class_imbalance_sampler(labels):
    class_count = torch.bincount(labels.squeeze())
    class_weighting = 1. / class_count
    sample_weights = class_weighting[labels]
    sampler = WeightedRandomSampler(sample_weights, len(labels))
    return sampler

# sample tensor (n x d) of type float. labels tensor (n x 1) of type long
sampler = class_imbalance_sampler(labels)
train_ld  = DataLoader(TensorDataset(samples, labels), sampler=sampler)
3 Likes

Is it possible to have shuffling on top of this?

1 Like

Just to clarify something that seems a bit confusing in the above discussions: the num_samples argument to WeightedRandomSampler should be the size of your dataset, not the number of dataset classes you have (or length of sampling weights array, as represented above). This tripped me up, maybe helpful to someone else.

4 Likes

It helped. Thank you

Why do you need to use the weighted sampler for test set? isnt this supposed to only be used in training ? whats the use of it in test/validation time? also the class_sample_counts is different for train and test sets most of the time!
I dont quitly get the intention here, any clarification is welcomed .
Thanks a lot in advance

weights= torch.DoubleTensor([.0009, .0001])  
sampler = torch.utils.data.sampler.WeightedRandomSampler(weights=weights, num_samples=len(dataset), replacement=True)


In my case, the y_target is all 0’s. Has anyone else faced this issue? I set a small batch size to track.

The weights tensor should contain sample weights, i.e. a weight for each sample in the dataset, while it seems you might pass class weigths, i.e. a weight for each class.
Have a look at this example from this thread.

1 Like

Thanks, looks like I was missing an equivalent of this part of the code:

samples_weight = np.array([weight[t] for t in y_train])
samples_weight = torch.from_numpy(samples_weight)

Thanks for the code.
I have one doubt, and I am stuck in it from long time, It would be of much help if anyone could help me solve it.
My dataset has 2000 classes in long tail distribution and so I am using ‘WeightedRandomSampler’ and this code is of much help. I am using batch size = 32
My doubt is that should I use replacement = True or False, if I want to have balanced distribution and keeping replacement = False, if I would iterate it completely than overall my model will not be trained on Balanced data as in this case there would be no over sampling or duplicate.
If I would use replacement = True than would the problem of getting balanced dataset for the whole iteration of data loader would be solved and would it have instances from all classes as batch_size < number of classes for this case.

Thanks

Double post from here. Let’s stick to the newly created topic for further discussion.

@ptrblck
I saw two different implementations of sampler throughout this post. I summarized them below. so do they make any practical difference? Thanks!

(1) Below sampler has batch_size:
weights = 1 / torch.Tensor(class_sample_count)
sampler = torch.utils.data.sampler.WeightedRandomSampler(weights, batch_size)

(2) Below sampler has size of train_dataset
sampler = torch.utils.data.sampler.WeightedRandomSampler(samples_weights, len(samples_weights))
train_loader = torch.utils.data.DataLoader(dataset_train, batch_size=args.batch_size, shuffle = True,
sampler = sampler)

The second approach would be right as explained here.

@ptrblck

Hi i am working text classification, dataset contains imbalanced labels
classcount = np.bincount(train_data[‘label’]).tolist()
train_weights = 1./torch.tensor(classcount, dtype=torch.float)
train__sampleweights = train_weights[train_data[‘label’]]

#class_weight
weighted_sampler = WeightedRandomSampler(
weights=train__sampleweights,
num_samples=len(train__sampleweights),
replacement=True)

but still in validation its not predicted well for less labels

Hi! I am newly in Pytorch! so I am so sorry if this question is too basic. But I have an imbalanced dataset (which priors are 0.77 and 0.23, for class 0 and 1, respectively). Following some post here, I am using the Dataloader with a sampler specified by

class_sample_count = np.zeros(len(classes,), dtype=int)
    for n, cl in enumerate(classes):
        class_sample_count[n] = sum(train_df.label==cl)
       
weights = (1 / class_sample_count)
target = train_df.label.to_numpy()
samples_weight = weights[target]
samples_weight=torch.tensor(samples_weight , dtype=torch.float)
sampler = torch.utils.data.sampler.WeightedRandomSampler(weights, len(weights))

so, as @ptrblck you are mentioning here, the batches are not balanced. I think I have two options here:

  1. compute the loss with the class weights
  2. discover how to ensure that by using sampler I get balanced batches.

If anyone has already dealt with this, I would much appreciate any help, since
for 1) I am not sure if the weights I should use for the loss is the original class prior or is the class distribution given that sampler (i.e, the number of positive and negatives samples within that batch) and for 2) I tried creating a samples_weight with all equals values for all samples, but didn’t change the umbalance problem within the batch calls.

Thank you!!

That’s rather unexpected, since your code seems to create the valid sample weights.
What distribution are you seeing in each batch?

Hi! thank you for your quick response, Here is the weirdest thing to me: the distribution between batches is not the same. I print the number of pos class and it was absolutely random, in between 2 over 10 samples and maybe 8 over 10 samples. I don’t know, I am completely lost, not sure what could be wrong.

Do you see the same unexpected distribution, if you increase the batch size?
The weighted sampling is still a random process, so the batches won’t be perfectly balanced, but should come close to the expected weight distribution. Could you also take a look at this post and check the results for different batch sizes and compare them to your results?