How does shuffle in data loader work?

cat-loves-donuts · July 4, 2019, 11:34pm

Well, I am just want to ask how pytorch shuffle the data set. And this question probably is a very silly question.

I mean I set shuffle as True in data loader. And I just wonder how this function influence the data set. For example, I put the whole MNIST data set which have 60000 data into the data loader and set shuffle as true. Does it possible that if I only use 30000 to train the model but the model cannot identify number 6 to 9 because I shuffled the data loader and all the sample of number 6 to 9 were shuffled into the last 30000 which I did not use?

I mean does it possible that the shuffle operation will lead to models that use partial training sets are not able to get all features of the whole data set?

ptrblck · July 4, 2019, 11:45pm

If you set shuffle=True, internally the RandomSampler will be used, which just permutes the indices of all samples as seen here.

Are you slicing the data before passing it to the DataLoader?
If not, all samples should be used regardless if shuffle is active of not.

cat-loves-donuts · July 5, 2019, 12:05am

Thank you so much for answering this question.

Well, I loaded the whole data set which have 60000 data into the data loader with shuffle = True. But when I train this model, I only use like 6400 of those data. I trained the model in 100 epoch and each epoch have 64 images.

So I am just wondering does those 6400 data have all numbers? I means does it possible I was very lucky that only got 0 to 5 and the model could not identify the rest of numbers when it was tested? If this happened, the accuracy should be lower…

But interesting thing is all my test accuracy is pretty normal. Like I trained it with 6400 data and test it on test data-set which have 10000 data and I still have 80% ~ 70% accuracy.

ptrblck · July 5, 2019, 10:33am

How did you use this subset of 6400 samples?
Did you use a Subset or RandomSubsetSampler?
Depending on the approach, you might have selected some particular classes and left out others.

However, if you’ve used some break statement inside the training loop for the DataLoader (which was supposed to use all data), it would be unlikely to just get a subset of classes.

cat-loves-donuts · July 5, 2019, 12:20pm

Well, this is my code:
def train_model(model, linear, input_CNN, learning_rate, nsamples, load_train_dataset):
epoch = 0
print_loss = 0
criterion = nn.CrossEntropyLoss()
optimizer_module = optim.SGD(model.parameters(), lr=learning_rate)
optimizer_linear = optim.SGD(linear.parameters(), lr=learning_rate)
optimizer_input_CNN = optim.SGD(input_CNN.parameters(), lr=learning_rate)

for i, (data, target) in enumerate(load_train_dataset()):
    if i > nsamples:
        break
    else:
        # train network

the parameter nsamples is 100 which means train 100 epoch. I don’t know if this code is right or not. I loaded all the data set into the data loadeer but use this nsamples to control the size of training data set.

ptrblck · July 5, 2019, 12:59pm

This code might work and should create random batches, if shuffle=True.
If you don’t want to use all samples, you could of course use a Subset and avoid using the break statement.

cat-loves-donuts · July 6, 2019, 2:18pm

Alright, I will try to use Subset rather than this.
But could you please help me one more thing? Could the way that I used make the model learn incomplete features of the training data set?

ptrblck · July 6, 2019, 2:19pm

Could you explain a bit, what you mean by “incomplete features of the training data set”?

cat-loves-donuts · July 6, 2019, 2:26pm

OK…
My fault, I mean I used MINST data set. Does it possible that my method let the model learn 0 to 5 but missing 6 to 9? And this model can only identify 0 to 5 and show a really poor result?

ptrblck · July 6, 2019, 2:38pm

It’s unlikely that your model has only seen samples from classes 0 to 5, if you’ve shuffled the dataset.
Even though you are only using approx. 10% of the dataset, the batches should be well shuffled.

Here is a small example:

dataset = datasets.MNIST(root='./data',
                         transform=transforms.ToTensor())
loader = DataLoader(dataset, batch_size=64, shuffle=True)

targets = []
for idx, (data, target) in enumerate(loader):
    targets.append(target)
    if idx >= 99:
        break

targets = torch.cat(targets)
print(targets.shape)
> torch.Size([6400])
print(targets.unique(return_counts=True))
>(tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), tensor([634, 666, 657, 666, 606, 579, 629, 684, 637, 642]))

As you can see, each class has approx. the same number of samples, which should avoid overfitting to a subset of the classes.

The overfitting might be a result of another code part. Are you manipulating the data in some other way?

cat-loves-donuts · July 6, 2019, 3:38pm

Ah ha…This is what I need. Thank you so much.
Well, my results are pretty good. I mean there is no overfitting or something like that. Because there is no overfitting, I just wondering if those assumptions are valid or not…

But thank you so much for answering me those questions. Right now I don’t need to change my code. Thank you so much.

saba · January 22, 2021, 6:40am

Hi Ptrblck,

Just wondering to be sure that random sampling that shuffling do in the data loader is uniform? It uniformly get indices of the data .

ptrblck · January 22, 2021, 6:44am

If shuffle=True, the DataLoader will use a RandomSampler as seen here, which uses torch.randperm in the default setup (replacement=False) and randomly permutes the sample indices. There is thus no uniform sampling but a permutation of the indices.