How to avoid memory leak when training multiple models sequentially

StefanKMdA · October 22, 2020, 8:02pm

Hi, first of all thank you for reading this question. I have a script (say main.py) where I train multiple models (with pytorch lightning) with different hyperparameters. The loop in the script is similar to the following code snippet:

for hparam in HPARAMS:
     trainer = Trainer(gpus=gpus)
     datamodule = Datamodule()
     model = Model(hparam)
     trainer.fit(model, datamodule)

After training the second model, I get a memory leak with the gpu. I have seen this question a couple of times in forums (sorry in advance) and no suggested solution worked for me.

I tried deleting all relevant variables that could store tensors with

del model, datamodule, trainer, logger
gc.collect()
torch.cuda.empty_cache()

but this did not fix the memory leak. I did this after every training of a model.

I read a couple of other suggestions like using ray or just starting a new subprocess for every new training but I thought there must be another way.

Any help is much appreciated!

ptrblck · October 24, 2020, 6:55am

Do you see the same memory leak without using Lightning, i.e. with a plain PyTorch training routine?

StefanKMdA · October 25, 2020, 9:54am

Thanks for looking into my question. I ended up just starting a new subprocess for each new model. Maybe another time I will look into if pytorch-lightning has anything to do with.

Molaire · January 2, 2021, 1:04am

I’ve had the same problem and I isolated it to Pytorch Lighting’s datamodule.

shawnvosburg · April 2, 2021, 1:00am

I’ve had a similar problem while doing active learning.

Here is what I found. Deleting the model, datamodule, trainer and logger does nothing to solve the gpu memory leak. After much investigative work, I found out that the optimizer (I was using torch.optim.Adam) stores pointers to the model’s weights and it has some other tensors of its own that also resides on the gpu. I wrongly believed that deleting both the model and the trainer would delete the optimizer object.

In my case, my problem was solved in two step:

# Assumes that your pytorch-lightning Model object
# has the pytorch model object as self.model
model.model.cpu()

to remove all the model’s weights from the gpu.

Then, we need to update the optimizer’s internal tensors and bring them out of the gpu. I did it the following way but I would like a more official way supported (does it already exists?).

# I only have Adam in my project but this may be different in your case.

for optimizer_metrics in trainer.optimizers[0].state.values():
    for metric_name, metric in optimizer_metrics.items():
        if torch.is_tensor(metric):
            optimizer_metrics[metric_name] = metric.cpu()

After these two steps, I am still left with one tensor residing on the gpu but I can’t find where it is still alive. Everytime I go through the loop, one extra tensor still resides on the gpu. That extra tensor per loop isn’t enough to make me bust my cuda memory but I would like to hear back if someone figures out where it is referenced.

MohammedAljahdali · July 11, 2021, 6:18am

Is there any updates on this issue? I am facing the same problem, because I train a model multiple times with different trainers, and every time a new trainer is initialised the gpu memory usage increase.