[DataLoader Problem] Problem arises when shuffle = True

I have training , validation and test dataset(NLP problem , So I used LSTM , GRU) . The model contains batch norm layer (I think this is the reason for discrepancy I am observing). I don’t have true labels for test dataset. This was my training procedure before :

  • Train on training dataset for 5 epochs (model.train() was used). For each epoch after the training I make prediction on validation dataset to see my results(model.eval() was used) and then save the model if the validation score(AUC score as metric) increased. At last I load the best model and then make predictions on test data (model.eval() was used).
    This is my pipeline:
for epoch_number in range(epochs):
       Train the model on training dataset.
       Make predictions on validation
       save the model using torch.save(model.state_dict(),fname) if the validation score increased.

Load the best model and then predictions on test data.

For 5 epochs my train loss and validation loss are

1) train loss =  0.2599791 , val loss =  0.2254444
2) train loss =  0.2198705 , val loss =  0.2254712
3) train loss =  0.2080045 , val loss =  0.2124491
4) train loss =  0.1860864 , val loss =  0.18708
5) train loss =  0.1701995 , val loss =  0.1935813

Recently I changed the procedure a little bit. It is as follows :

  • Train on training dataset for 5 epochs(model.train() was used), For each epoch after the training I make predictions on both validation and test dataset(model.eval() was used for both) and then save the validation and test predictions for future use. This is my pipeline
for epoch_number in range(epochs):
       Train the model.
       Make predictions on validation and test.
       save the predictions.

This is where I am seeing some discrepancy. As it’s obvious that train loss and validation loss must remain same because I am making predictions at every epoch and this will not update the weights. But these are my train loss and validation loss

1) train loss =  0.2599791 , val loss =  0.2254444
2) train loss =  0.2196528 , val loss =  0.2283426
3) train loss =  0.2078255 , val loss =  0.1996013
4) train loss =  0.1848111 , val loss =  0.182577
5) train loss =  0.1680537 , val loss =  0.1896651

Note that the first epoch train loss and validation loss in both cases is same (This the sole reason I have provided these numbers) and then there are changes after 1st epoch.

Why this discrepancy is happening?

I have been thinking of this problem for some days and here are my thoughts on this

  • When you train with batch norm it keeps running estimates and then use these running estimates in model.eval() mode. But I think in the eval mode the running estimates are also changed and this is the reason for discrepancy. May be I am wrong it’s just my view.
1 Like

The are different factors than can produce this behavior. First at all, Are you seeding all frameworks that you are using to generate random numbers? Are you setting cudnn to determinisic mode? You can do it using this piece of code:

torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

Moreover you need to take into account some gpu operations are non-deterministic This link can be useful for you. (I am supposing you are using GPU.)

Answering your last point. As far as I know batchnorm is not updated in evaluation time. You can find more information about batchnorm here .

1 Like
def seed_everything(SEED=42):
    torch.backends.cudnn.deterministic = True
    # torch.backends.cudnn.benchmark = False


I am seeding everything as shown above. Each part is reproducible but when I change from one part to other one , I am observing this. Can I know whats the problem ?

Do you mean that if you execute X times the first pipeline you are obtaining the same result? I suppose that you did not change the order of init of your model, dataloaders, right? Without more information about your network and your code it is difficult to me to know what is the problem. If you provide a piece of code to reproduce this behavior I can take a look.

Moreover, I think it will be a good idea to set cudnn.benchmark to false if your input sizes changes at each iteration. cudnn.benchmark choose the best algorithm for your input data. That can be a source of randomness. It is quite possible the cudnn lstm kernels are not deterministic as it is said here.


I am sorry for not giving the enough details. The first and second pipeline are reproducible. (I have tried running for 3 to 4 times and results are the same). I checked twice and I am not changing the dataloader for training , validation and test dataset.

Let’s suppose there is some problem with cudnn , then how come the first epoch losses are same in the both the pipelines.

I also tried this: When I don’t make predictions on test data in second pipeline then the losses and AUC score on validation data are same as first pipeline.

I see your point. One last question, Are you shuffling data in validation and test sets? If the answer is yes, this can be the problem. As I said previously without more information about your network, and training code it is not possible to me to guess what is the problem.

1 Like

I forgot to mention this point. Shuffling is done only on train and not on validation and test.

The code that I have written is for a kaggle competition , so I can’t share the code totally.

Some details are (Some things I have mentioned above but writing
model : I am using LSTM’s and GRU’s in my model.
Seed : Everything is seeded as I have mentioned above.
Reproducible : Both pipelines are totally reproducible. Only discrepancy is shift from piepliene 1 to pipeline 2.

I have the impression that something is shifting the random state in your pipeline, but I can not be sure, you can check this deactivating the shuffle in trainning dataset (in both pipelines). I do not have more ideas at this moment, I am sorry :frowning: Maybe someone else can give a hint to you.

1 Like

amosella, you are right I think there is something that is shifting the random state in my 2nd pipeline. So when I kept shuffle = False for training dataset both pipelines produced the same result.

Other things I tried are:

  • Removed batch norm from the model and shuffle = True gave different results(from 2nd epoch onwards ) implies there is no problem with batch norm (I have the changed title name of the thread).

As 1st epoch results are same for both the pipelines I think adding test predictions is changing the random state.

  • removed test predictions from the second pipeline and added validation predictions one more time. It is shown below
for epoch_number in range(epochs):
        Train the model
        Make predictions on validation 
        Make predictions on validation             # 2nd time
        save the model if validation score increases..

and to my surprise the results are same as my second pipeline.

Hmm, are you doing data augmentation on validation and test datasets? Somewhere you are calling a random method that produce this shifting in the random state.

No augmentation on any of the datasets.

I will go through once more

amosella, this is my code structure , and I couldn’t figure where is the problem :frowning_face:

def  fit_data():
      for epoch_num in range(epochs):
         for train_data in train_iterator:
             calculate loss
          for val_data in val_iterator:
          for test_data in test_iterator:
      save the validation and test predictions.

the iterators for all the datasets are output of the corresponding dataloaders . I don’t know where it is inducing the shift in random state.

Can you put shuffle=false in the data loader? in the documentation it is said that shuffle is default false. Can you verify it? Moreover put workers to 0 and check it.

I put workers = 0 and results are different.

I do not have more ideas :cry:. I think that you can use this function to look who is changing the state (using prints… )

1 Like

Thanks for the help amosella. :smile:

So I could reproduce my problem. The link to code is given below :

As you can see from the output that train_id is different for two cases which must not happen. My original code is of same format.

Maybe, I am a bit late to the party. Why do you think train_ids should not be different?

1 Like

I am iterating over train only once in an epoch in both the codes. So I expect that the id’s in both cases must be same or I am missing something

I think there is a misunderstanding. I see that train ids in both codes are same.

First script:

15 examples of train
tensor([76., 59., 48., 42., 32., 49., 70.,  3., 98., 63., 90., 25., 91., 92.,

Second script:

15 examples of train
tensor([76., 59., 48., 42., 32., 49., 70.,  3., 98., 63., 90., 25., 91., 92.,

Or did I misunderstand your question?

1 Like