Fanny
July 22, 2021, 10:32am
1
Hi everyone,
I’m trying to train an instance segmentation model on GPU based on this pytorch tutorial https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html .
However, after a few epochs (3 with a batch_size of 1, 6 with a batch_size of 2), all of the available RAM is used and the training is stopped.
How can that be fixed ?
I have a GPU with 16Go memory, my PC has 32Go RAM and I try to train the model with ~5000 images.
It seems that the rise in RAM usage is happening during losses.backward() in the train_one_epoch function (available here ).
Thanks.
Can you try to remove
metric_logger.update(loss=losses_reduced, **loss_dict_reduced)
It seems a tensor which carries the whole graph hist.
If you want to log the loss log this one loss_value = losses_reduced.item()
which is a python float.
1 Like
Fanny
July 23, 2021, 6:48am
3
Thanks for the answer, I tried but it didn’t change anything.
Then it’s difficult to know.
I would recommend you to remove everything but the basics and from there keep adding stuff until you find it
Fanny
July 26, 2021, 1:35pm
5
Thanks.
Just like I said, when I comment out the line losses.backward()
, the RAM usage is constant over time.