Using 'for loop' to train model multiple times produces vastly different results

Rohit_R · June 23, 2021, 2:54pm

In my training script, I have a function ‘train’ that carries out the model training for a certain number of epochs and the training proceeds successfully. The loss gradually decreases and I obtain a decent validation set accuracy.

Now, I wanted to train the same model 3 times. So I essentially created a ‘for loop’ that executes my ‘train’ function 3 times. However, I observed that the first 2 runs produces decent val set accuracies. However in the third run my model completely fails to train and the training loss stays almost the same at a very high value. I do not understand why this is happening and think it might be due to PyTorch’s random seed or with cuda memory allocation.

Note : In my ‘train’ function, I re-initialize all my models, optimizer and scheduler. The only thing that remains the same across the runs is my ‘train_loader’ instance of the Dataloader class. I am not using any manual seed, so I do expect some variation in my val results however the model is not training at all in the third run.

Thanks a lot !

ptrblck · June 24, 2021, 3:32am

Based on the description it seems that your overall training is unstable and the success rate of the training could be low. You could retrain N models using different seeds for each run and check how many times the model converges properly vs. a failure. To stabilize it, you could try to use other parameter initialization methods etc.

I don’t quite understand how the CUDA memory allocation could be related to this issue. Could you explain your concern a bit more?

Rohit_R · June 24, 2021, 4:15am

Thank you so much for your suggestions! I will surely try training with different initial seeds. I was thinking that maybe CUDA cached the models weights of the previous run and somehow used these weights for the next run even though I re-initialized the model. Although I think this is mostly not the case. Thanks again!