Memory leaks while training multiple models in sequence, doesn't learn

evgeniititov · December 13, 2019, 11:42pm

Hi guys.

I am working on a Python module that would allow me to train multiple builtin torchvision models in sequence (one after another) keeping track of training results. Sometimes, I just need a rough idea of what network would perform better for my problem, so that I can more carefully play with it later.

The problem:
When it comes to training I have a list of models (resnets, vggs, alexnet, inception, densenet and squeezenet), a basic FOR loop to train one network after another and a dictionary where I track training results of each model. I later visualize them. I made sure everything works fine, first training sessions went pretty.
But there is a problem, now very quickly models stop to train (accuracy around 45-55%, random guesses for 2 classes), first 2 models in the list seem to train fine though. On top of that, on around 4th network in my list (VGG19) I get an error - out of memory, even though when I train only VGG19 alone everything works fine. Clearly, models share some resource which gets filled at some point and training becomes impossible.

The question: could anyone please help me identify those memory leaks and fix the issue? My guesses:

All models get moved to GPU. Should I remove it from there at the end of each training?
Optimizer, loss function and scheduler get initialized for each new model to train, so I think there should be no problem. Correct?
During training batches and labels get moved to GPU as well. Should I remove them at the end, so that GPU memory is available for the next training session or it gets cleaned automatically?

In case anyone wishes to have a look at the code - https://github.com/EvgeniiTitov/Neural-Net-Trainer
Would love to hear adequate criticism from you people. The trainer class can be found in wrappers/ line 291 - GroupTrainer. It gets called from the main.py

Thank you very much guys!

Regards,
Eugene

evgeniititov · December 14, 2019, 12:34pm

A quick update on the issue.

When I try to train VGGs, AlexNet and Squeezenet1.0 from scratch for some reason they do no learn at all. Keep bouncing around 50% (0.457 0.542). I tried everything, different optimizers (SGD, Adam), learning rates, number of epoch. On the other hand, in exactly the same conditions ResNets (18,34,50), Inception, Densenet seem to be training okay.

I have literally no idea what I am doing wrong. I’ve been following this tutorial - https://pytorch.org/tutorials/beginner/finetuning_torchvision_models_tutorial.html What interesting is if you scroll down, the guy attempts to train a model from scratch and he ends up with the same problem. His model doesn’t learn at all, the accuracy doesn’t change - 0.457. He doesn’t comment on it sadly.

If anyone could explain why this is the case please. Literally have been following the tutorial. On top of that, when I train VGGs [VGG16, VGG19] in sequence, the code crashes on VGG19 and says CUDA is out of memory. BUT, when I train it alone it trains but doesn’t learn anything anyway. At the end of each training session for each model (basic FOR loop) I delete everything to make sure it doesn’t affect the next training sessions: model, trained model, optimizer, loss function etc.

Thanks!

Kushaj · December 14, 2019, 2:50pm

To check if your model is not learning you need to check the requires_grad attribute for model weights and make sure the optimizer is correct. To check the requires_grad:

for param in my_model.parameters():
    print(param.requires_grad)

Training crash? You need to remove your previous models from GPU memory. This can be bit tricky. If you are running regular python script, then the following may work

del my_model
torch.cuda.empty_cache()

#Maybe run gc.collect()

gkuling · January 26, 2022, 3:01pm

Hello, I know this is an older post but curious if there has been any update? It does not appear to have been resolved and I believe I am having the same issue.

Thanks - Grey