I have a very big dataset, and I would like to use a different random subset for each epoch of 1000 samples. Is there any way I can do it using Dataset and Dataloader?
I would like something like torch.utils.data.RandomSampler
but without replacement.
train_loader = DataLoader(
train_dataset,
batch_size=32,
shuffle=False,
num_workers=1,
pin_memory=True,
drop_last=True,
sampler=SubsetRandomSampler(
torch.randint(high=len(train_dataset), size=(1000,))
),
)
Edit: What I would really like is to have each epoch a maximum number of samples. So I have no problem of having all samples in dataset and randomly select 1000 samples each epoch.
Edit2: I came up with the following:
class RandomSampler(Sampler):
def __init__(self, data_source, num_samples=None):
self.data_source = data_source
self._num_samples = num_samples
if not isinstance(self.num_samples, int) or self.num_samples <= 0:
raise ValueError(
"num_samples should be a positive integer "
"value, but got num_samples={}".format(self.num_samples)
)
@property
def num_samples(self):
# dataset size might change at runtime
if self._num_samples is None:
return len(self.data_source)
return self._num_samples
def __iter__(self):
n = len(self.data_source)
return iter(torch.randperm(n, dtype=torch.int64)[: self.num_samples].tolist())
def __len__(self):
return self.num_samples
I edited the default RandomSampler in order to be able to samples without replacement, but i don’t know if this is the correct solution.