Well, I am just want to ask how pytorch shuffle the data set. And this question probably is a very silly question.
I mean I set shuffle as True in data loader. And I just wonder how this function influence the data set. For example, I put the whole MNIST data set which have 60000 data into the data loader and set shuffle as true. Does it possible that if I only use 30000 to train the model but the model cannot identify number 6 to 9 because I shuffled the data loader and all the sample of number 6 to 9 were shuffled into the last 30000 which I did not use?
I mean does it possible that the shuffle operation will lead to models that use partial training sets are not able to get all features of the whole data set?
If you set
shuffle=True, internally the
RandomSampler will be used, which just permutes the indices of all samples as seen here.
Are you slicing the data before passing it to the
If not, all samples should be used regardless if shuffle is active of not.
Thank you so much for answering this question.
Well, I loaded the whole data set which have 60000 data into the data loader with shuffle = True. But when I train this model, I only use like 6400 of those data. I trained the model in 100 epoch and each epoch have 64 images.
So I am just wondering does those 6400 data have all numbers? I means does it possible I was very lucky that only got 0 to 5 and the model could not identify the rest of numbers when it was tested? If this happened, the accuracy should be lower…
But interesting thing is all my test accuracy is pretty normal. Like I trained it with 6400 data and test it on test data-set which have 10000 data and I still have 80% ~ 70% accuracy.
How did you use this subset of 6400 samples?
Did you use a
Depending on the approach, you might have selected some particular classes and left out others.
However, if you’ve used some break statement inside the training loop for the
DataLoader (which was supposed to use all data), it would be unlikely to just get a subset of classes.
Well, this is my code:
def train_model(model, linear, input_CNN, learning_rate, nsamples, load_train_dataset):
epoch = 0
print_loss = 0
criterion = nn.CrossEntropyLoss()
optimizer_module = optim.SGD(model.parameters(), lr=learning_rate)
optimizer_linear = optim.SGD(linear.parameters(), lr=learning_rate)
optimizer_input_CNN = optim.SGD(input_CNN.parameters(), lr=learning_rate)
for i, (data, target) in enumerate(load_train_dataset()):
if i > nsamples:
# train network
the parameter nsamples is 100 which means train 100 epoch. I don’t know if this code is right or not. I loaded all the data set into the data loadeer but use this nsamples to control the size of training data set.
This code might work and should create random batches, if
If you don’t want to use all samples, you could of course use a
Subset and avoid using the break statement.
Alright, I will try to use Subset rather than this.
But could you please help me one more thing? Could the way that I used make the model learn incomplete features of the training data set?
Could you explain a bit, what you mean by “incomplete features of the training data set”?
It’s unlikely that your model has only seen samples from classes 0 to 5, if you’ve shuffled the dataset.
Even though you are only using approx. 10% of the dataset, the batches should be well shuffled.
Here is a small example:
dataset = datasets.MNIST(root='./data',
loader = DataLoader(dataset, batch_size=64, shuffle=True)
targets = 
for idx, (data, target) in enumerate(loader):
if idx >= 99:
targets = torch.cat(targets)
>(tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), tensor([634, 666, 657, 666, 606, 579, 629, 684, 637, 642]))
As you can see, each class has approx. the same number of samples, which should avoid overfitting to a subset of the classes.
The overfitting might be a result of another code part. Are you manipulating the data in some other way?
Just wondering to be sure that random sampling that shuffling do in the data loader is uniform? It uniformly get indices of the data .
DataLoader will use a
RandomSampler as seen here, which uses
torch.randperm in the default setup (
replacement=False) and randomly permutes the sample indices. There is thus no uniform sampling but a permutation of the indices.