New subset every epoch

I have a very big dataset, and I would like to use a different random subset for each epoch of 1000 samples. Is there any way I can do it using Dataset and Dataloader?
I would like something like torch.utils.data.RandomSampler but without replacement.

train_loader = DataLoader(
    train_dataset,
    batch_size=32,
    shuffle=False,
    num_workers=1,
    pin_memory=True,
    drop_last=True,
    sampler=SubsetRandomSampler(
        torch.randint(high=len(train_dataset), size=(1000,))
    ),
)

Edit: What I would really like is to have each epoch a maximum number of samples. So I have no problem of having all samples in dataset and randomly select 1000 samples each epoch.

Edit2: I came up with the following:

class RandomSampler(Sampler):
    def __init__(self, data_source, num_samples=None):
        self.data_source = data_source
        self._num_samples = num_samples

        if not isinstance(self.num_samples, int) or self.num_samples <= 0:
            raise ValueError(
                "num_samples should be a positive integer "
                "value, but got num_samples={}".format(self.num_samples)
            )

    @property
    def num_samples(self):
        # dataset size might change at runtime
        if self._num_samples is None:
            return len(self.data_source)
        return self._num_samples

    def __iter__(self):
        n = len(self.data_source)
        return iter(torch.randperm(n, dtype=torch.int64)[: self.num_samples].tolist())

    def __len__(self):
        return self.num_samples

I edited the default RandomSampler in order to be able to samples without replacement, but i don’t know if this is the correct solution.

3 Likes

I think your first approach should also work, as SubsetRandomSampler doesn’t use replacement or did you see any issues using it?

But SubsetRandomSampler uses for all epochs the same subset samples. And what I would like is to use for every epoch a new random sampling of all the dataset.

Ah yeah, sorry for not mentioning it, but you could recreate the DataLoader with a new sampler in each epoch, which should be cheap, if you are lazily loading the data.

Yes, I thought of doing that, but I wanted to create a DataLoader that could do it without recreating it. What do you think of the Edit2 in the first post? Would it do the trick?
I tried it, and I think it is working correctly. But I would like to be sure if it is genuinely random.

Your approach looks correct. To verify it, I would suggest to print the index in Dataset..__getitem__ for a couple of epochs and make sure that you are seeing a variety of indices.

1 Like

One can also just run a break after the amount of data per epoch with shuffle on.
In this way one gets a random subsample of the whole data per epoch. And an idea could be that if you orginal have 1000it/epoch you can set the break efter 500 and twice the number of original epochs.

for epoch in range(num_epochs*2):
            runs = 0
            for item in tqdm(dataloader):
                runs = runs+1
                if runs>500:
                  break