File “C:\ProgramData\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py”, line 3296, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File “”, line 1, in
from sampler.py import ImbalancedDatasetSampler
File “C:\ProgramData\Anaconda3\lib\sampler.py”, line 7
Thank you. I did that. I saved the script in the torch.utils.data.sampler script and it seemed to throw a _C error. I’m not sure where to save the actual s script file.
Just to clarify something that seems a bit confusing in the above discussions: the num_samples argument to WeightedRandomSampler should be the size of your dataset, not the number of dataset classes you have (or length of sampling weights array, as represented above). This tripped me up, maybe helpful to someone else.
Why do you need to use the weighted sampler for test set? isnt this supposed to only be used in training ? whats the use of it in test/validation time? also the class_sample_counts is different for train and test sets most of the time!
I dont quitly get the intention here, any clarification is welcomed .
Thanks a lot in advance
The weights tensor should contain sample weights, i.e. a weight for each sample in the dataset, while it seems you might pass class weigths, i.e. a weight for each class.
Have a look at this example from this thread.
Thanks for the code.
I have one doubt, and I am stuck in it from long time, It would be of much help if anyone could help me solve it.
My dataset has 2000 classes in long tail distribution and so I am using ‘WeightedRandomSampler’ and this code is of much help. I am using batch size = 32
My doubt is that should I use replacement = True or False, if I want to have balanced distribution and keeping replacement = False, if I would iterate it completely than overall my model will not be trained on Balanced data as in this case there would be no over sampling or duplicate.
If I would use replacement = True than would the problem of getting balanced dataset for the whole iteration of data loader would be solved and would it have instances from all classes as batch_size < number of classes for this case.
@ptrblck
I saw two different implementations of sampler throughout this post. I summarized them below. so do they make any practical difference? Thanks!
Hi i am working text classification, dataset contains imbalanced labels
classcount = np.bincount(train_data[‘label’]).tolist()
train_weights = 1./torch.tensor(classcount, dtype=torch.float)
train__sampleweights = train_weights[train_data[‘label’]]
Hi! I am newly in Pytorch! so I am so sorry if this question is too basic. But I have an imbalanced dataset (which priors are 0.77 and 0.23, for class 0 and 1, respectively). Following some post here, I am using the Dataloader with a sampler specified by
class_sample_count = np.zeros(len(classes,), dtype=int)
for n, cl in enumerate(classes):
class_sample_count[n] = sum(train_df.label==cl)
weights = (1 / class_sample_count)
target = train_df.label.to_numpy()
samples_weight = weights[target]
samples_weight=torch.tensor(samples_weight , dtype=torch.float)
sampler = torch.utils.data.sampler.WeightedRandomSampler(weights, len(weights))
so, as @ptrblck you are mentioning here, the batches are not balanced. I think I have two options here:
compute the loss with the class weights
discover how to ensure that by using sampler I get balanced batches.
If anyone has already dealt with this, I would much appreciate any help, since
for 1) I am not sure if the weights I should use for the loss is the original class prior or is the class distribution given that sampler (i.e, the number of positive and negatives samples within that batch) and for 2) I tried creating a samples_weight with all equals values for all samples, but didn’t change the umbalance problem within the batch calls.
Hi! thank you for your quick response, Here is the weirdest thing to me: the distribution between batches is not the same. I print the number of pos class and it was absolutely random, in between 2 over 10 samples and maybe 8 over 10 samples. I don’t know, I am completely lost, not sure what could be wrong.
Do you see the same unexpected distribution, if you increase the batch size?
The weighted sampling is still a random process, so the batches won’t be perfectly balanced, but should come close to the expected weight distribution. Could you also take a look at this post and check the results for different batch sizes and compare them to your results?