How to explain huge GPU RAM usage?


latest pytoch, cuda 9 and cudnn7 installed using conda on linux (lubuntu 16.04, latest nvidiam driver).

I am running this code on a computer with 2x GTX 1080 Ti:

Previsously, I was running the imagenet example from the pytorch example.

For both of them, the RAM needed to run the training seems to be expensive according to the same kind of training using caffe or tensorflow - preventing it to be greedy on the RAM.
For instance, on the pose estimation, there is a burst of RAM either at end of an epoch either at the beginning of the evaluation step leading to a cuda error (no more memory). With 11Gb of RAM on each GPU, I can run the training only with batch size = 32. The script is using, within an epoch, only half of the available RAM.

For the imagenet example, same problem, I fixed the batch_size to 80 to be able to train a VGG19_bn. Another problem, when restarting from a check point, the script dies at the end of each epoch with cuda error: no more memory, even with the same batch size.

Can anyone help me understand this burst of RAM?

Thank you.

If it can help, here is the package version from my conda environment:

Name Version Build Channel

My gueses are:

  • You don’t evaluate under a with torch.no_grad() environment.
  • You are storing the tensor losses instead of their values (that is, you are missing calling the item() method of the tensor) keeping in memory the whole computation graph.

Check those two first and see if that’s the problem. At this moment I don’t have time to check out your code.

Thank you for your answer.

I already test the loss hypothesis. I did not test with torch.no_grad(). The training is currently running (but it will failed as the memory is more and more filled… ).

If someone has some other hypothesis about this problem, he is welcome to submit it before the next traning.

thank you.

Sorry for my late answer. Thus, the point:

  • I did not manage to make it work with ` torch.no_grad(). And actually, I want to compare with other runs thus I drop this idea.
  • I tried to add some “del” on tensors (as seen in pytorch document), no changes appeared. The amazing thing is that pytorch start with half of the GPU memory and finish to fill it after several epochs but I have no cuda errors running the training (at the end, the memory of both GPU is almost full). I still do not undestand what is hapenning… Is there any way to track all memory blocks in the GPU and where they were allocated?

Thank you.

PS: anyway, my run went fine, it is just a pity that with 2x1080 Ti GPUs, I can not run with bigger batch size…