Balanced Sampling between classes with torchvision DataLoader

vpeterson · May 8, 2021, 7:11pm

You were right! by increasing the batch_size now the distribution is closer to 50/50 (sometimes more like 60/50). uff I am glad I asked! thank you!

jasperhyp · June 22, 2022, 4:30pm

Hi @ptrblck , Thanks for answering all those questions! I was just wondering if balancing the samples in batches is a good idea. From a brief view, it looks like this would change the negative sample distribution and potentially create a bias, especially for very few shot cases where perhaps the positive proportion is << 1%. Is there some ways to just “stratify” each batch according to labels as we do sklearn.model_selection.train_test_split, but not necessarily changing the weights of samples?

ptrblck · June 23, 2022, 3:34am

I think you could directly use the scikit-learn stratify split functionality and create the stratified indices for the splits. Once this is done you could use these indices in a custom BatchSampler and load the entire batch of (stratified) samples in the Dataset.__getitem__.

Ruthvik_BVS · February 17, 2023, 6:06am

@kuzand
Not necessarily.
According to the official source code of WeightedRandomSampler:
weights (sequence) : a sequence of weights, not necessary summing up to one.
You can choose to normalize the weights yourself, but it’s not necessary.