RAM exhausted while training on GPU

Fanny · July 22, 2021, 10:32am

Hi everyone,
I’m trying to train an instance segmentation model on GPU based on this pytorch tutorial https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html.

However, after a few epochs (3 with a batch_size of 1, 6 with a batch_size of 2), all of the available RAM is used and the training is stopped.

How can that be fixed ?

I have a GPU with 16Go memory, my PC has 32Go RAM and I try to train the model with ~5000 images.

It seems that the rise in RAM usage is happening during losses.backward() in the train_one_epoch function (available here).

Thanks.

JuanFMontesinos · July 22, 2021, 12:32pm

Can you try to remove
metric_logger.update(loss=losses_reduced, **loss_dict_reduced)
It seems a tensor which carries the whole graph hist.
If you want to log the loss log this one loss_value = losses_reduced.item() which is a python float.

Fanny · July 23, 2021, 6:48am

Thanks for the answer, I tried but it didn’t change anything.

JuanFMontesinos · July 23, 2021, 8:06pm

Then it’s difficult to know.
I would recommend you to remove everything but the basics and from there keep adding stuff until you find it

Fanny · July 26, 2021, 1:35pm

Thanks.
Just like I said, when I comment out the line losses.backward(), the RAM usage is constant over time.