I am working on a Python module that would allow me to train multiple builtin torchvision models in sequence (one after another) keeping track of training results. Sometimes, I just need a rough idea of what network would perform better for my problem, so that I can more carefully play with it later.
When it comes to training I have a list of models (resnets, vggs, alexnet, inception, densenet and squeezenet), a basic FOR loop to train one network after another and a dictionary where I track training results of each model. I later visualize them. I made sure everything works fine, first training sessions went pretty.
But there is a problem, now very quickly models stop to train (accuracy around 45-55%, random guesses for 2 classes), first 2 models in the list seem to train fine though. On top of that, on around 4th network in my list (VGG19) I get an error - out of memory, even though when I train only VGG19 alone everything works fine. Clearly, models share some resource which gets filled at some point and training becomes impossible.
The question: could anyone please help me identify those memory leaks and fix the issue? My guesses:
- All models get moved to GPU. Should I remove it from there at the end of each training?
- Optimizer, loss function and scheduler get initialized for each new model to train, so I think there should be no problem. Correct?
- During training batches and labels get moved to GPU as well. Should I remove them at the end, so that GPU memory is available for the next training session or it gets cleaned automatically?
In case anyone wishes to have a look at the code - https://github.com/EvgeniiTitov/Neural-Net-Trainer
Would love to hear adequate criticism from you people. The trainer class can be found in wrappers/ line 291 - GroupTrainer. It gets called from the main.py
Thank you very much guys!
A quick update on the issue.
When I try to train VGGs, AlexNet and Squeezenet1.0 from scratch for some reason they do no learn at all. Keep bouncing around 50% (0.457 0.542). I tried everything, different optimizers (SGD, Adam), learning rates, number of epoch. On the other hand, in exactly the same conditions ResNets (18,34,50), Inception, Densenet seem to be training okay.
I have literally no idea what I am doing wrong. I’ve been following this tutorial - https://pytorch.org/tutorials/beginner/finetuning_torchvision_models_tutorial.html What interesting is if you scroll down, the guy attempts to train a model from scratch and he ends up with the same problem. His model doesn’t learn at all, the accuracy doesn’t change - 0.457. He doesn’t comment on it sadly.
If anyone could explain why this is the case please. Literally have been following the tutorial. On top of that, when I train VGGs [VGG16, VGG19] in sequence, the code crashes on VGG19 and says CUDA is out of memory. BUT, when I train it alone it trains but doesn’t learn anything anyway. At the end of each training session for each model (basic FOR loop) I delete everything to make sure it doesn’t affect the next training sessions: model, trained model, optimizer, loss function etc.
To check if your model is not learning you need to check the requires_grad attribute for model weights and make sure the optimizer is correct. To check the requires_grad:
for param in my_model.parameters():
Training crash? You need to remove your previous models from GPU memory. This can be bit tricky. If you are running regular python script, then the following may work
#Maybe run gc.collect()
Hello, I know this is an older post but curious if there has been any update? It does not appear to have been resolved and I believe I am having the same issue.
Thanks - Grey