Hello guys, I would like to select mini-batches randomly using DataLoader where each mini-batches keep holding the same item. Can I do this via DataLoader or do I just have to manually iter dataset? Although it’s possible to use just manual iteration, I just wonder about this since I still want to exploit the fancy features provided by DataLoader(e.g., collate_fn, pin_memory, num_workers…)
For example, suppose there is a dataset with item indices from 1 to n, chronologically sorted.
If the mini-batch size is k, then I want the dataset is divided into (1 ~ k), (k+1 ~ 2k), (2k+1 ~ 3k), …, where each mini-batch is randomly selected for every epoch, but never changes the items inside the mini-batch.
data = DataLoader(dataset, batch_size = 8, num_workers = 4, pin_memory = True, shuffle = True, drop_last = False)
@Minseok for instance lets consider the above DataLoader, in order to enable batch shuffling you need to pass shuffle = True…num_workers, batch_size, pin_memory depends on your hardware, optimal selection of these parameters reduce the overheads and loads the data from CPU to GPU faster.
Thanks for your answer, but using only dataloader parameters won’t fix items within batch,
I rather came up with an idea, which is
- receive the ‘random’ minibatch index from a dataloader with parameter batch_size=1
getitem function in my custom Dataset class delivers consecutive items of mini-batch size (suppose k) once called.
def __getitem__(self, index):
batch_item = torch.zeros(self.batch_size, self.data_size)
for i in range(index*self.batch_size, (index+1)*self.batch_size):
batch_item[i] = self._get_data_function(i) # custom single data getter function for given index
However, I’m worried about the ‘for loop’ which might harm speed.
Strange. Are you using a custom dataset written by you or a dataset from
If you are using a custom dataset, can you verify
__getitem__ method if you are handling the indices correctly?
Would you be able to provide a toy example to reproduce the issue?