Balanced Sampling between classes with torchvision DataLoader

You were right! by increasing the batch_size now the distribution is closer to 50/50 (sometimes more like 60/50). uff I am glad I asked! thank you!

Hi @ptrblck , Thanks for answering all those questions! I was just wondering if balancing the samples in batches is a good idea. From a brief view, it looks like this would change the negative sample distribution and potentially create a bias, especially for very few shot cases where perhaps the positive proportion is << 1%. Is there some ways to just “stratify” each batch according to labels as we do sklearn.model_selection.train_test_split, but not necessarily changing the weights of samples?

I think you could directly use the scikit-learn stratify split functionality and create the stratified indices for the splits. Once this is done you could use these indices in a custom BatchSampler and load the entire batch of (stratified) samples in the Dataset.__getitem__.

1 Like

@kuzand
Not necessarily.
According to the official source code of WeightedRandomSampler:
weights (sequence) : a sequence of weights, not necessary summing up to one.
You can choose to normalize the weights yourself, but it’s not necessary.