How are the images shuffled within the batches?

Hi all,

I was wondering how the DataLoader handle the shuffling of images.

Scenario
Let’s say my batch size is 64. I have a dataset of 64.000 training images, therefore, my training_dataloader will create 1.000 batches, each of which will contain 64 images.

I understand that, the moment of selecting which batch comes next in the training loop picks a random index among all the batches. Is this right?

Question
Does each of those 1.000 batches always contains the same 64 images or every time each single batch samples 64 images randomly?

Thank you very much in advance

I think the DataLoader just shuffles the complete dataset (i.e., creates a index array of range [0, num_examples] and shuffles this) at each epoch. Then it sweeps over these indices in sequential order. E.g., if you have a dataset like [0, 1, 2, 3, 4, 5, 6] with batch size 2, a random order could be [4, 2, 1, 6, 0, 3, 5], and the minibatches would then be
[4, 2], [1, 6], [0, 3], [5].

This way, you would have random sampling WITHOUT replacement.

I understand that, the moment of selecting which batch comes next in the training loop picks a random index among all the batches. Is this right?

This would be random sampling WITH replacement. I am 99.9% sure that this is not what the DataLoader does (although it is also a valid approach; actually, it is even more “correct” if you think of stochastic gradient descent; as far as I know, it is less common though and maybe not working so well empirically; with large datasets sizes, e.g., > 500k, i doubt you will notice any difference in the resulting model)

1 Like

You are absolutely correct. If no sampler was passed to the DataLoader and shuffle=True was set, a RandomSampler will be used.
The replacement argument is set to False by default, so basically this line of code will be executed as you’ve described.

1 Like

Nice, I didn’t now that this existed!

1 Like

Scenario:
The question comes from the fact that I am training a ResNet on CIFAR10 and the training loss reaches 0% at around epoch 100 and I still don’t match enough accuracy on the test set.

I don’t think I am overfitting since the validation loss keeps reducing as well. So I didn’t go for increasing the weight decay.

Question:
Any idea on how could I manage to kind of keep room for improvement and not reach the 100% accuracy while training so fast? @rasbt @ptrblck

I thought in increasing the shuffling but given your answer this is not an option any more :sweat_smile:

Thank you!

one of the many things to try is data augmentation (some random rotation & translation, for example)

I am following the ResNet paper settings for CIFAR10.
I think I give way more insights of what could be happening in this new question after I looked at my validation loss.

What is the meaning of the shape?
Could that be the problem why I can’t reach more than 92% of testing accuracy?

Thank you for your explanation. Is there any working example regarding the usage of Dataloader WITH replacement?

I think the easiest way would be to create a Random(Weighted)Sampler and just pass it to the DataLoader. I would recommend to implement the replacement logic in the sampler, not the Dataset directly.

1 Like

Thank you for your response. I understand the logic. However, a working example can make things very easy. Do you have or saw any such working example?

Here is a small dummy example:

data = torch.randn(10, 3, 224, 224)
target = torch.arange(10)

dataset = TensorDataset(data, target)
sampler = torch.utils.data.sampler.RandomSampler(
    dataset,
    replacement=True
)

loader = DataLoader(
    dataset, 
    batch_size=1,
    sampler=sampler
)

for data, target in loader:
    print(target)

Let me know, if you need more information! :slight_smile:

1 Like

Thank you so much. Yes, I believe that can solve my problem. I recently developed a similar working example and based on your answer I believe I am on the right track!