In my training script, I have a function ‘train’ that carries out the model training for a certain number of epochs and the training proceeds successfully. The loss gradually decreases and I obtain a decent validation set accuracy.
Now, I wanted to train the same model 3 times. So I essentially created a ‘for loop’ that executes my ‘train’ function 3 times. However, I observed that the first 2 runs produces decent val set accuracies. However in the third run my model completely fails to train and the training loss stays almost the same at a very high value. I do not understand why this is happening and think it might be due to PyTorch’s random seed or with cuda memory allocation.
Note : In my ‘train’ function, I re-initialize all my models, optimizer and scheduler. The only thing that remains the same across the runs is my ‘train_loader’ instance of the Dataloader class. I am not using any manual seed, so I do expect some variation in my val results however the model is not training at all in the third run.
Thanks a lot !