I’ve been using WeightedRandomSampler but due to the stochastic nature of the process a mini-batch can oftentimes contain the same instance twice. Is there a way to guarantee that all instances are unique within a mini-batch while maintaining the other properties of the sampler? Note that since I am trying to oversample one of the two classes, I am using sampling with replacement. Thank you!
I think you could write a custom sampler, which could use the WeightedRandomSamplerimplementation as the base and create indices with non-duplicates using the batch size.
What kind of issues are you seeing with these duplicates?
The issue is technical as I am using a stateful dataset where each instance has information stored and processed at each batch. Hence when the same instance is found twice in a mini-batch I have all sorts of errors that arise that I’d rather solve via the sampler rather than modifying my current implementation.
I’ve implemented a sampler that seems to work for now but is probably not that optimal. Any thoughts on it?
from torch.utils.data import WeightedRandomSampler
from typing import Sequence
import torch
class CustomWeightedSampler(WeightedRandomSampler):
def __init__(self, weights: Sequence[float], num_samples: int, bs: int,
generator=None) -> None:
if not isinstance(num_samples, int) or isinstance(num_samples, bool) or \
num_samples <= 0:
raise ValueError("num_samples should be a positive integer "
"value, but got num_samples={}".format(num_samples))
if not isinstance(bs, int) or isinstance(bs, bool) or \
num_samples <= 0:
raise ValueError("bs should be a positive integer "
"value, but got bs={}".format(num_samples))
if bs > num_samples:
raise ValueError("bs should be smaller than num_samples "
"but got bs={} and num_samples={}".format(bs, num_samples))
self.weights = weights
self.num_samples = num_samples
self.bs = bs
self.generator=generator
def __iter__(self):
rand_tensor = torch.multinomial(self.weights, self.bs, False, generator=self.generator)
for _ in range((self.num_samples - self.bs) // self.bs):
rand_tensor = torch.cat([rand_tensor,
torch.multinomial(self.weights, self.bs, False, generator=self.generator)])
if self.num_samples % self.bs != 0:
rand_tensor = torch.cat([rand_tensor,
torch.multinomial(self.weights, self.num_samples % self.bs, False, generator=self.generator)])
return iter(rand_tensor.tolist())
def __len__(self):
return super(CustomWeightedSampler, self).__len__()
could sample repeated values.
While torch.multinominal wouldn’t use replacement, you are still passind the same weights, and thus the sampled values might already be in rand_tensor, no?
Correct me if I am wrong but I assumed that "_iter_()"returns the order of indexes that will be sampled during dataloading. In other words, given a list [0, 1, 2, 3] and a batch size of 2, the indexes of the two mini–batches would be [0, 1] and [2, 3].
Given that the above is correct, to solve my initial issue, I simply sampled without replacement each mini–batch independently. In theory this would translate to sampling with replacement over multiple mini–batches while maintaining unique instance within each mini–batch. Let me know if I am missing something!
Yes, __iter__ will return an iterator, which will return the indices sampled from rand_tensor.
And yes, you are right. rand_tensor uses the batch size as the “stride” and doesn’t use replacement, so you should be fine.