How to use WeightedSampler along with a custom Sampler?

Achu_Chandran · September 24, 2020, 3:23pm

I am a using a custom sampler class similar to the one described here : How upload sequence of image on video-classification suggested by @ptrblck (https://discuss.pytorch.org/u/ptrblck) to stack up images sequentially from the datasets before feeding into the network.

Now I also wanted to use WeightedSampler to balance the classes.

Any ideas to use a WeightedSampler along with a custom sampler class ?

Thanks in advance !!

tom · September 24, 2020, 8:04pm

I’d probably take @ptrblck’s sampler and

compute the class weights in __init__,
have a tensor that contains the class weight for each index,
use torch.multinomial with the weights instead of randperm.

For the latter, you can compare RandomSampler

    def __iter__(self):
        return (self.indices[i] for i in torch.randperm(len(self.indices), generator=self.generator))

to WeightedRandomSampler

    def __iter__(self):
        rand_tensor = torch.multinomial(self.weights, self.num_samples, self.replacement, generator=self.generator)
        return iter(rand_tensor.tolist())

(except you want the indices her, too).

Note that you likely want replacement=True.

Best regards

Thomas

Achu_Chandran · September 25, 2020, 9:40am

Hello @tom, thanks for your response.

To compute the weights, I need the count for each class. But whereas in my case the sampler is used to create the datasets too. So, only after creating the datasets i can get access to the sample count.
( I used sampler to get the indices of images and stack up images sequentially to create samples, then create datasets and load it using the DataLoader)

Do you have any suggestion for this? I can attach the code if required.

Thanks again

tom · September 25, 2020, 12:24pm

I don’t think there is anything wrong per se with looping over the dataset to get the class distribution. That said, if it takes a long time and you expect to run your training often, the typical thing is to make it a preprocessing step (just like e.g. the famous ImageNet mean and std for normalization have been part of preprocessing before people just kept them hardcoded).
Creating a dataloader isn’t that expensive (the expensive stuff is only done when iterating them and re-done every epoch), so there isn’t anything wrong with having one that is used to collect statistics and then creating a new one with the weighted sampler.

Best regards

Thomas

Achu_Chandran · September 25, 2020, 12:50pm

Thanks. If I understood correctly, you mean I can have one sampler for preparing datasets and a weighted sampler for iterating them ?

tom · September 25, 2020, 1:43pm

Yes, you can use the same Dataset but wrap it in different Dataloaders (with different RandomSamplers).

Achu_Chandran · September 25, 2020, 1:52pm

Thank you, @tom.

This helps a lot.