[DataLoader Problem] Problem arises when shuffle = True

I am sorry for not giving the enough details. The first and second pipeline are reproducible. (I have tried running for 3 to 4 times and results are the same). I checked twice and I am not changing the dataloader for training , validation and test dataset.

Let’s suppose there is some problem with cudnn , then how come the first epoch losses are same in the both the pipelines.

I also tried this: When I don’t make predictions on test data in second pipeline then the losses and AUC score on validation data are same as first pipeline.

I see your point. One last question, Are you shuffling data in validation and test sets? If the answer is yes, this can be the problem. As I said previously without more information about your network, and training code it is not possible to me to guess what is the problem.

1 Like

I forgot to mention this point. Shuffling is done only on train and not on validation and test.

The code that I have written is for a kaggle competition , so I can’t share the code totally.

Some details are (Some things I have mentioned above but writing
model : I am using LSTM’s and GRU’s in my model.
Seed : Everything is seeded as I have mentioned above.
Reproducible : Both pipelines are totally reproducible. Only discrepancy is shift from piepliene 1 to pipeline 2.

I have the impression that something is shifting the random state in your pipeline, but I can not be sure, you can check this deactivating the shuffle in trainning dataset (in both pipelines). I do not have more ideas at this moment, I am sorry :frowning: Maybe someone else can give a hint to you.

1 Like

amosella, you are right I think there is something that is shifting the random state in my 2nd pipeline. So when I kept shuffle = False for training dataset both pipelines produced the same result.

Other things I tried are:

  • Removed batch norm from the model and shuffle = True gave different results(from 2nd epoch onwards ) implies there is no problem with batch norm (I have the changed title name of the thread).

As 1st epoch results are same for both the pipelines I think adding test predictions is changing the random state.

  • removed test predictions from the second pipeline and added validation predictions one more time. It is shown below
for epoch_number in range(epochs):
        Train the model
        Make predictions on validation 
        Make predictions on validation             # 2nd time
        save the model if validation score increases..

and to my surprise the results are same as my second pipeline.

Hmm, are you doing data augmentation on validation and test datasets? Somewhere you are calling a random method that produce this shifting in the random state.

No augmentation on any of the datasets.

I will go through once more

amosella, this is my code structure , and I couldn’t figure where is the problem :frowning_face:

def  fit_data():
      for epoch_num in range(epochs):
         for train_data in train_iterator:
             predict
             calculate loss
             backprop
          
          for val_data in val_iterator:
             predict
          
          for test_data in test_iterator:
             predict
      save the validation and test predictions.

the iterators for all the datasets are output of the corresponding dataloaders . I don’t know where it is inducing the shift in random state.

Can you put shuffle=false in the data loader? in the documentation it is said that shuffle is default false. Can you verify it? Moreover put workers to 0 and check it.

I put workers = 0 and results are different.

I do not have more ideas :cry:. I think that you can use this function to look who is changing the state (using prints… )

1 Like

Thanks for the help amosella. :smile:

So I could reproduce my problem. The link to code is given below :

As you can see from the output that train_id is different for two cases which must not happen. My original code is of same format.

Maybe, I am a bit late to the party. Why do you think train_ids should not be different?

1 Like

I am iterating over train only once in an epoch in both the codes. So I expect that the id’s in both cases must be same or I am missing something

I think there is a misunderstanding. I see that train ids in both codes are same.

First script:

15 examples of train
tensor([76., 59., 48., 42., 32., 49., 70.,  3., 98., 63., 90., 25., 91., 92.,
        51.])

Second script:

15 examples of train
tensor([76., 59., 48., 42., 32., 49., 70.,  3., 98., 63., 90., 25., 91., 92.,
        51.])

Or did I misunderstand your question?

1 Like

From the second epoch onwards they are different.

I have raise the issue here : https://github.com/pytorch/pytorch/issues/20717

I had a look at the github issue and the relevant files.
It is a tricky issue and it is caused by the line that updates the rng state (unnecessarily?): https://github.com/pytorch/pytorch/blob/master/torch/utils/data/dataloader.py#L437

I see there are 2 workarounds.

  1. The Dataloader code can be fixed by placing the base seed calculation inside the if loop. (https://github.com/pytorch/pytorch/pull/20749)

  2. Wrap your training code with get_rng_state(), set_rng_state() function calls, as below:

prev_rng_state = torch.get_rng_state()  # get previous rng state

for ep_num in range(3):
    print("==================================",ep_num+1,"========================")

    torch.set_rng_state(prev_rng_state) # set rng state
    for batch,(X_train,y_train,weights) in enumerate(train_iterator):
        if batch==0:
            print("15 examples of train")
            print(X_train[0:15, 0])
            
    prev_rng_state = torch.get_rng_state() # save rng state
    
    for batch,(X_val,y_val) in enumerate(val_iterator):
        if batch==0:
            print("15 examples of validation")
            print(X_val[0:15,0])
    
    for batch,(X_test,y_test) in enumerate(test_iterator):
        if batch==0:
            print("15 examples of test")
            print(X_test[0:15,0])
1 Like

Thanks a lot Arul. Now I am getting the same outputs. A small change to your code if you want both the pipelines to get the same output :

prev_rng_state = torch.get_rng_state()  # get previous rng state

for ep_num in range(10):
    print("==================================",ep_num+1,"========================")

    torch.set_rng_state(prev_rng_state) # set rng state
    for batch,(X_train,y_train,weights) in enumerate(train_iterator):
        if batch==0:
            print("15 examples of train")
            print(X_train[0:15, 0])
             
    for batch,(X_val,y_val) in enumerate(val_iterator):
        if batch==0:
            print("15 examples of validation")
            print(X_val[0:15,0])
    
    prev_rng_state = torch.get_rng_state() # save rng state
    
    for batch,(X_test,y_test) in enumerate(test_iterator):
        if batch==0:
            print("15 examples of test")
            print(X_test[0:15,0])