Dataloader shuffling best practice

zetyquickly · September 8, 2021, 3:43pm

It is the place where you can find the answers.

Is the shuffle=True parameter in the dataloader sufficient to ensure a complete random distribution of the data?

This parameter facilitates the permutation of dataset’s indices. That means each epoch you’ll draw random batch_size samples of data, each iteration.

Should the call to __get_item__() for a given idx must return the same tensor every time?

The __getitem__ method is expected to return the data that available in dataset. You are definitely can introduce some randomization to it, i.e. each time you call ds[idx] it is able to return different data. For example if you do randomly parameterized augmentations (transformations) then it’ll give you different tensors each time you call ds[idx]. But in my opinion there’s no particular need to return ds[10] when you ask for ds[139], while shuffle parameter in Dataloader exists.

Does it depend on the optimizer or something else (I use Adam’s optimizer in a classification context)?

Adam is a variation of gradient descent. In PyTorch SGD perform an operation that is known as mini-batch gradient descent. So, when you do shuffle you introduce randomness (making it stochastic) that helps the loss to converge

In fact, the tutorial doesn’t seem to rearrange the contents of the batch at each time.

Unfortunately, I haven’t tried the tutorial but as long as Dataloader contains shuffle=True this loader will output each epoch different samples in a batch, i.e. if you inspect first batch of each epoch it probably will have different set of objects from dataset

Hope it helps!