Balanced Sampling between classes with torchvision DataLoader

Jordan_Howell · November 16, 2019, 1:54am

I tried and got this error;

Traceback (most recent call last):

File “C:\ProgramData\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py”, line 3296, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)

File “”, line 1, in
from sampler.py import ImbalancedDatasetSampler

File “C:\ProgramData\Anaconda3\lib\sampler.py”, line 7

^
SyntaxError: invalid syntax

InnovArul · November 21, 2019, 12:26am

sorry for the delay. Hope you have figured out the issue already.

It should be

from sampler import ImbalancedDatasetSampler

Jordan_Howell · November 21, 2019, 1:51am

Thank you. I did that. I saved the script in the torch.utils.data.sampler script and it seemed to throw a _C error. I’m not sure where to save the actual s script file.

Jordan

gabrieldernbach · January 30, 2020, 4:10pm

this one works fine for me and is reasonably terse

from torch.utils.data.sampler import WeightedRandomSampler
from torch.utils.data import DataLoader, TensorDataset

def class_imbalance_sampler(labels):
    class_count = torch.bincount(labels.squeeze())
    class_weighting = 1. / class_count
    sample_weights = class_weighting[labels]
    sampler = WeightedRandomSampler(sample_weights, len(labels))
    return sampler

# sample tensor (n x d) of type float. labels tensor (n x 1) of type long
sampler = class_imbalance_sampler(labels)
train_ld  = DataLoader(TensorDataset(samples, labels), sampler=sampler)

alx · February 26, 2020, 1:38am

Is it possible to have shuffling on top of this?

Andrei_C · April 19, 2020, 2:49pm

Just to clarify something that seems a bit confusing in the above discussions: the num_samples argument to WeightedRandomSampler should be the size of your dataset, not the number of dataset classes you have (or length of sampling weights array, as represented above). This tripped me up, maybe helpful to someone else.

Sabrina_Dhalla · June 25, 2020, 12:09am

It helped. Thank you

Shisho_Sama · July 27, 2020, 6:33am

Why do you need to use the weighted sampler for test set? isnt this supposed to only be used in training ? whats the use of it in test/validation time? also the class_sample_counts is different for train and test sets most of the time!
I dont quitly get the intention here, any clarification is welcomed .
Thanks a lot in advance

sunt · July 28, 2020, 7:54pm

weights= torch.DoubleTensor([.0009, .0001])  
sampler = torch.utils.data.sampler.WeightedRandomSampler(weights=weights, num_samples=len(dataset), replacement=True)

In my case, the y_target is all 0’s. Has anyone else faced this issue? I set a small batch size to track.

ptrblck · July 30, 2020, 7:43am

The weights tensor should contain sample weights, i.e. a weight for each sample in the dataset, while it seems you might pass class weigths, i.e. a weight for each class.
Have a look at this example from this thread.

sunt · July 30, 2020, 8:02am

Thanks, looks like I was missing an equivalent of this part of the code:

samples_weight = np.array([weight[t] for t in y_train])
samples_weight = torch.from_numpy(samples_weight)

teyimo3726 · August 18, 2020, 6:49pm

Thanks for the code.
I have one doubt, and I am stuck in it from long time, It would be of much help if anyone could help me solve it.
My dataset has 2000 classes in long tail distribution and so I am using ‘WeightedRandomSampler’ and this code is of much help. I am using batch size = 32
My doubt is that should I use replacement = True or False, if I want to have balanced distribution and keeping replacement = False, if I would iterate it completely than overall my model will not be trained on Balanced data as in this case there would be no over sampling or duplicate.
If I would use replacement = True than would the problem of getting balanced dataset for the whole iteration of data loader would be solved and would it have instances from all classes as batch_size < number of classes for this case.

Thanks

ptrblck · August 19, 2020, 3:47am

Double post from here. Let’s stick to the newly created topic for further discussion.

leopardyao · December 7, 2020, 11:49am

@ptrblck
I saw two different implementations of sampler throughout this post. I summarized them below. so do they make any practical difference? Thanks!

(1) Below sampler has batch_size:
weights = 1 / torch.Tensor(class_sample_count)
sampler = torch.utils.data.sampler.WeightedRandomSampler(weights, batch_size)

(2) Below sampler has size of train_dataset
sampler = torch.utils.data.sampler.WeightedRandomSampler(samples_weights, len(samples_weights))
train_loader = torch.utils.data.DataLoader(dataset_train, batch_size=args.batch_size, shuffle = True,
sampler = sampler)

ptrblck · December 8, 2020, 12:05am

The second approach would be right as explained here.

sripathisony_sony · January 16, 2021, 8:05pm

@ptrblck

Hi i am working text classification, dataset contains imbalanced labels
classcount = np.bincount(train_data[‘label’]).tolist()
train_weights = 1./torch.tensor(classcount, dtype=torch.float)
train__sampleweights = train_weights[train_data[‘label’]]

#class_weight
weighted_sampler = WeightedRandomSampler(
weights=train__sampleweights,
num_samples=len(train__sampleweights),
replacement=True)

but still in validation its not predicted well for less labels

vpeterson · May 7, 2021, 10:30pm

Hi! I am newly in Pytorch! so I am so sorry if this question is too basic. But I have an imbalanced dataset (which priors are 0.77 and 0.23, for class 0 and 1, respectively). Following some post here, I am using the Dataloader with a sampler specified by

class_sample_count = np.zeros(len(classes,), dtype=int)
    for n, cl in enumerate(classes):
        class_sample_count[n] = sum(train_df.label==cl)
       
weights = (1 / class_sample_count)
target = train_df.label.to_numpy()
samples_weight = weights[target]
samples_weight=torch.tensor(samples_weight , dtype=torch.float)
sampler = torch.utils.data.sampler.WeightedRandomSampler(weights, len(weights))

so, as @ptrblck you are mentioning here, the batches are not balanced. I think I have two options here:

compute the loss with the class weights
discover how to ensure that by using sampler I get balanced batches.

If anyone has already dealt with this, I would much appreciate any help, since
for 1) I am not sure if the weights I should use for the loss is the original class prior or is the class distribution given that sampler (i.e, the number of positive and negatives samples within that batch) and for 2) I tried creating a samples_weight with all equals values for all samples, but didn’t change the umbalance problem within the batch calls.

Thank you!!

ptrblck · May 7, 2021, 10:37pm

That’s rather unexpected, since your code seems to create the valid sample weights.
What distribution are you seeing in each batch?

vpeterson · May 8, 2021, 3:22am

Hi! thank you for your quick response, Here is the weirdest thing to me: the distribution between batches is not the same. I print the number of pos class and it was absolutely random, in between 2 over 10 samples and maybe 8 over 10 samples. I don’t know, I am completely lost, not sure what could be wrong.

ptrblck · May 8, 2021, 6:32am

Do you see the same unexpected distribution, if you increase the batch size?
The weighted sampling is still a random process, so the batches won’t be perfectly balanced, but should come close to the expected weight distribution. Could you also take a look at this post and check the results for different batch sizes and compare them to your results?