Dataloader shuffling best practice

ArthurV · September 8, 2021, 12:28pm

Hello,

I am fairly new to deep learning algorithms and a question comes to mind regarding how to properly shuffle my training data. Looking at the tutorial data_loading_tutorial.html#dataset-class which describes how to create our own dataset, the __get_item__() method seems to be the key point of this class. My question is this:

Should the call to __get_item__() for a given idx must return the same tensor every time? or in other words: should the batch 0 be the same across epochs? Is the shuffle=True parameter in the dataloader sufficient to ensure a complete random distribution of the data? Does it depend on the optimizer or something else (I use Adam’s optimizer in a classification context)? In fact, the tutorial doesn’t seem to rearrange the contents of the batch at each time.

I am sorry if this is not the right place to post my question, please show me where it would be most appropriate to post my question.

Thank you,

zetyquickly · September 8, 2021, 3:43pm

Hello @ArthurV

It is the place where you can find the answers.

Is the shuffle=True parameter in the dataloader sufficient to ensure a complete random distribution of the data?

This parameter facilitates the permutation of dataset’s indices. That means each epoch you’ll draw random batch_size samples of data, each iteration.

Should the call to __get_item__() for a given idx must return the same tensor every time?

The __getitem__ method is expected to return the data that available in dataset. You are definitely can introduce some randomization to it, i.e. each time you call ds[idx] it is able to return different data. For example if you do randomly parameterized augmentations (transformations) then it’ll give you different tensors each time you call ds[idx]. But in my opinion there’s no particular need to return ds[10] when you ask for ds[139], while shuffle parameter in Dataloader exists.

Does it depend on the optimizer or something else (I use Adam’s optimizer in a classification context)?

Adam is a variation of gradient descent. In PyTorch SGD perform an operation that is known as mini-batch gradient descent. So, when you do shuffle you introduce randomness (making it stochastic) that helps the loss to converge

In fact, the tutorial doesn’t seem to rearrange the contents of the batch at each time.

Unfortunately, I haven’t tried the tutorial but as long as Dataloader contains shuffle=True this loader will output each epoch different samples in a batch, i.e. if you inspect first batch of each epoch it probably will have different set of objects from dataset

Hope it helps!

ArthurV · September 9, 2021, 6:48am

Thank you for your quick answer. I note that it is not necessary to re-shuffle my data-set at each epoch as long as the dataloader has it’s shuffle parameter activated.