Random batch sampling from DataLoader

razvanc92 · March 2, 2021, 10:59am

Hello, I’m interesting if it’s possible to randomly sample X batches of data from a DataLoader object for each epoch. Is there any way of accessing the batches by indexes? Or something similar to achieve such behavior? Thank you for the help.

Laura_Montalvo · March 2, 2021, 11:02am

I guess what you’re asking the following:

If you set the dataloader class shuffle=True you can get random samples into your batches. I mean the batches will be created with random samples of the dataset insead of sequential ones.
For accesing the batches with indexes try with for batch, (data, target) in enumerate(loader) so the batch variable will store the value of the batch (I think it works that way)

razvanc92 · March 2, 2021, 11:10am

Thank you for your replay. I already tried this solution, the problem is that for each epoch (training loop) I want to select a small subset of batches (lets say 5 batches). If i simply enumerate though the loader (enumerate(loader)) I will get the same order (the shuffled version from the data loader) since.

I am interested if something like this is possible (or some other way to achieve the same behaviour):

data_loader = torch.utils.data.DataLoader(data, shuffle=True)
for epoch in range(100):
   random_ids = np.random.randint(len(data_loader), size=5)
   batches = data_loader[random_ids]

Laura_Montalvo · March 2, 2021, 11:12am

Sorry then! I dont know how to help you with that

razvanc92 · March 2, 2021, 11:13am

No worries, I’ve been struggling with this for a while.

ptrblck · March 2, 2021, 7:29pm

If you want to sample only a specific subset using predefined indices, you could create a new DataLoader with a SubsetRandomSampler or wrap the Dataset into a Subset.

Flock1 · December 3, 2022, 12:33am

Hi,

I did this

num_test_examples = len(test_dl)
indices = torch.randperm(num_test_examples)[:100]
random_test_dataset = SubsetRandomSampler(test_dl, indices)

random_test_dataloader = DataLoader(random_test_dataset, batch_size=1, shuffle=False)
for i,data in random_test_dataloader:
    print(i)

Then I get this error:

TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_84972/1562982579.py in <module>
----> 1 for i,data in random_test_dataloader:
      2     print(i)

~/anaconda3/envs/torch/lib/python3.8/site-packages/torch/utils/data/dataloader.py in __next__(self)
    626                 # TODO(https://github.com/pytorch/pytorch/issues/76750)
    627                 self._reset()  # type: ignore[call-arg]
--> 628             data = self._next_data()
    629             self._num_yielded += 1
    630             if self._dataset_kind == _DatasetKind.Iterable and \

~/anaconda3/envs/torch/lib/python3.8/site-packages/torch/utils/data/dataloader.py in _next_data(self)
    669     def _next_data(self):
    670         index = self._next_index()  # may raise StopIteration
--> 671         data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
    672         if self._pin_memory:
    673             data = _utils.pin_memory.pin_memory(data, self._pin_memory_device)

~/anaconda3/envs/torch/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
     56                 data = self.dataset.__getitems__(possibly_batched_index)
     57             else:
---> 58                 data = [self.dataset[idx] for idx in possibly_batched_index]
     59         else:
...
---> 58                 data = [self.dataset[idx] for idx in possibly_batched_index]
     59         else:
     60             data = self.dataset[possibly_batched_index]

TypeError: 'SubsetRandomSampler' object is not subscriptable

What should I do here?

ptrblck · December 3, 2022, 12:35am

You have to pass the SubsetRandomSampler as the sampler argument, not as the dataset.

Flock1 · December 3, 2022, 12:58am

Like this?

random_test_dataloader = DataLoader(test_dl, batch_size=1, shuffle=False, sampler=random_test_dataset)