Weighted Random Sampling with unique samples for each mini-batch

Neofytos · March 11, 2021, 10:02pm

Hey,

I’ve been using WeightedRandomSampler but due to the stochastic nature of the process a mini-batch can oftentimes contain the same instance twice. Is there a way to guarantee that all instances are unique within a mini-batch while maintaining the other properties of the sampler? Note that since I am trying to oversample one of the two classes, I am using sampling with replacement. Thank you!

Neo

ptrblck · March 12, 2021, 7:53am

I think you could write a custom sampler, which could use the WeightedRandomSampler implementation as the base and create indices with non-duplicates using the batch size.
What kind of issues are you seeing with these duplicates?

Neofytos · March 12, 2021, 6:47pm

Hey @ptrblck, thanks for the reply.

The issue is technical as I am using a stateful dataset where each instance has information stored and processed at each batch. Hence when the same instance is found twice in a mini-batch I have all sorts of errors that arise that I’d rather solve via the sampler rather than modifying my current implementation.

I’ve implemented a sampler that seems to work for now but is probably not that optimal. Any thoughts on it?

from torch.utils.data import WeightedRandomSampler
from typing import Sequence
import torch

class CustomWeightedSampler(WeightedRandomSampler):
    def __init__(self, weights: Sequence[float], num_samples: int, bs: int,
                 generator=None) -> None:
        if not isinstance(num_samples, int) or isinstance(num_samples, bool) or \
                num_samples <= 0:
            raise ValueError("num_samples should be a positive integer "
                             "value, but got num_samples={}".format(num_samples))
        if not isinstance(bs, int) or isinstance(bs, bool) or \
                num_samples <= 0:
            raise ValueError("bs should be a positive integer "
                             "value, but got bs={}".format(num_samples))
        if bs > num_samples:
            raise ValueError("bs should be smaller than num_samples "
                             "but got bs={} and num_samples={}".format(bs, num_samples))
        self.weights = weights
        self.num_samples = num_samples
        self.bs = bs
        self.generator=generator


    def __iter__(self):
        rand_tensor = torch.multinomial(self.weights, self.bs, False, generator=self.generator)
        for _ in range((self.num_samples - self.bs) // self.bs):
            rand_tensor = torch.cat([rand_tensor,
                                      torch.multinomial(self.weights, self.bs, False, generator=self.generator)])
        if self.num_samples % self.bs != 0:
            rand_tensor = torch.cat([rand_tensor,
                                      torch.multinomial(self.weights,  self.num_samples % self.bs, False, generator=self.generator)])
        return iter(rand_tensor.tolist())

    def __len__(self):
        return super(CustomWeightedSampler, self).__len__()

ptrblck · March 12, 2021, 9:25pm

I don’t see where you are checking for duplicated indices in the code and think that:

rand_tensor = torch.multinomial(self.weights, self.bs, False, generator=self.generator)
for _ in range((self.num_samples - self.bs) // self.bs):
    rand_tensor = torch.cat([rand_tensor,
                             torch.multinomial(self.weights, self.bs, False, generator=self.generator)])

could sample repeated values.
While torch.multinominal wouldn’t use replacement, you are still passind the same weights, and thus the sampled values might already be in rand_tensor, no?

Neofytos · March 19, 2021, 10:25am

Hey, sorry for my late reply.

Correct me if I am wrong but I assumed that "_iter_()"returns the order of indexes that will be sampled during dataloading. In other words, given a list [0, 1, 2, 3] and a batch size of 2, the indexes of the two mini–batches would be [0, 1] and [2, 3].

Given that the above is correct, to solve my initial issue, I simply sampled without replacement each mini–batch independently. In theory this would translate to sampling with replacement over multiple mini–batches while maintaining unique instance within each mini–batch. Let me know if I am missing something!

ptrblck · March 20, 2021, 4:46am

Yes, __iter__ will return an iterator, which will return the indices sampled from rand_tensor.
And yes, you are right. rand_tensor uses the batch size as the “stride” and doesn’t use replacement, so you should be fine.