Random batch sampling from DataLoader

Hello, I’m interesting if it’s possible to randomly sample X batches of data from a DataLoader object for each epoch. Is there any way of accessing the batches by indexes? Or something similar to achieve such behavior? Thank you for the help.

1 Like

I guess what you’re asking the following:

  • If you set the dataloader class shuffle=True you can get random samples into your batches. I mean the batches will be created with random samples of the dataset insead of sequential ones.
  • For accesing the batches with indexes try with for batch, (data, target) in enumerate(loader) so the batch variable will store the value of the batch (I think it works that way)

Thank you for your replay. I already tried this solution, the problem is that for each epoch (training loop) I want to select a small subset of batches (lets say 5 batches). If i simply enumerate though the loader (enumerate(loader)) I will get the same order (the shuffled version from the data loader) since.

I am interested if something like this is possible (or some other way to achieve the same behaviour):

data_loader = torch.utils.data.DataLoader(data, shuffle=True)
for epoch in range(100):
   random_ids = np.random.randint(len(data_loader), size=5)
   batches = data_loader[random_ids]
1 Like

Sorry then! I dont know how to help you with that :grimacing:

No worries, I’ve been struggling with this for a while.

If you want to sample only a specific subset using predefined indices, you could create a new DataLoader with a SubsetRandomSampler or wrap the Dataset into a Subset.

1 Like

Hi,

I did this

num_test_examples = len(test_dl)
indices = torch.randperm(num_test_examples)[:100]
random_test_dataset = SubsetRandomSampler(test_dl, indices)

random_test_dataloader = DataLoader(random_test_dataset, batch_size=1, shuffle=False)
for i,data in random_test_dataloader:
    print(i)

Then I get this error:

TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_84972/1562982579.py in <module>
----> 1 for i,data in random_test_dataloader:
      2     print(i)

~/anaconda3/envs/torch/lib/python3.8/site-packages/torch/utils/data/dataloader.py in __next__(self)
    626                 # TODO(https://github.com/pytorch/pytorch/issues/76750)
    627                 self._reset()  # type: ignore[call-arg]
--> 628             data = self._next_data()
    629             self._num_yielded += 1
    630             if self._dataset_kind == _DatasetKind.Iterable and \

~/anaconda3/envs/torch/lib/python3.8/site-packages/torch/utils/data/dataloader.py in _next_data(self)
    669     def _next_data(self):
    670         index = self._next_index()  # may raise StopIteration
--> 671         data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
    672         if self._pin_memory:
    673             data = _utils.pin_memory.pin_memory(data, self._pin_memory_device)

~/anaconda3/envs/torch/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
     56                 data = self.dataset.__getitems__(possibly_batched_index)
     57             else:
---> 58                 data = [self.dataset[idx] for idx in possibly_batched_index]
     59         else:
...
---> 58                 data = [self.dataset[idx] for idx in possibly_batched_index]
     59         else:
     60             data = self.dataset[possibly_batched_index]

TypeError: 'SubsetRandomSampler' object is not subscriptable

What should I do here?

You have to pass the SubsetRandomSampler as the sampler argument, not as the dataset.

Like this?

random_test_dataloader = DataLoader(test_dl, batch_size=1, shuffle=False, sampler=random_test_dataset)