Well, I am just want to ask how pytorch shuffle the data set. And this question probably is a very silly question.
I mean I set shuffle as True in data loader. And I just wonder how this function influence the data set. For example, I put the whole MNIST data set which have 60000 data into the data loader and set shuffle as true. Does it possible that if I only use 30000 to train the model but the model cannot identify number 6 to 9 because I shuffled the data loader and all the sample of number 6 to 9 were shuffled into the last 30000 which I did not use?
I mean does it possible that the shuffle operation will lead to models that use partial training sets are not able to get all features of the whole data set?
Well, I loaded the whole data set which have 60000 data into the data loader with shuffle = True. But when I train this model, I only use like 6400 of those data. I trained the model in 100 epoch and each epoch have 64 images.
So I am just wondering does those 6400 data have all numbers? I means does it possible I was very lucky that only got 0 to 5 and the model could not identify the rest of numbers when it was tested? If this happened, the accuracy should be lower…
But interesting thing is all my test accuracy is pretty normal. Like I trained it with 6400 data and test it on test data-set which have 10000 data and I still have 80% ~ 70% accuracy.
How did you use this subset of 6400 samples?
Did you use a Subset or RandomSubsetSampler?
Depending on the approach, you might have selected some particular classes and left out others.
However, if you’ve used some break statement inside the training loop for the DataLoader (which was supposed to use all data), it would be unlikely to just get a subset of classes.
for i, (data, target) in enumerate(load_train_dataset()):
if i > nsamples:
break
else:
# train network
the parameter nsamples is 100 which means train 100 epoch. I don’t know if this code is right or not. I loaded all the data set into the data loadeer but use this nsamples to control the size of training data set.
This code might work and should create random batches, if shuffle=True.
If you don’t want to use all samples, you could of course use a Subset and avoid using the break statement.
Alright, I will try to use Subset rather than this.
But could you please help me one more thing? Could the way that I used make the model learn incomplete features of the training data set?
OK…
My fault, I mean I used MINST data set. Does it possible that my method let the model learn 0 to 5 but missing 6 to 9? And this model can only identify 0 to 5 and show a really poor result?
It’s unlikely that your model has only seen samples from classes 0 to 5, if you’ve shuffled the dataset.
Even though you are only using approx. 10% of the dataset, the batches should be well shuffled.
Ah ha…This is what I need. Thank you so much.
Well, my results are pretty good. I mean there is no overfitting or something like that. Because there is no overfitting, I just wondering if those assumptions are valid or not…
But thank you so much for answering me those questions. Right now I don’t need to change my code. Thank you so much.
If shuffle=True, the DataLoader will use a RandomSampler as seen here, which uses torch.randperm in the default setup (replacement=False) and randomly permutes the sample indices. There is thus no uniform sampling but a permutation of the indices.