latest pytoch, cuda 9 and cudnn7 installed using conda on linux (lubuntu 16.04, latest nvidiam driver).
I am running this code on a computer with 2x GTX 1080 Ti:
Previsously, I was running the imagenet example from the pytorch example.
For both of them, the RAM needed to run the training seems to be expensive according to the same kind of training using caffe or tensorflow - preventing it to be greedy on the RAM.
For instance, on the pose estimation, there is a burst of RAM either at end of an epoch either at the beginning of the evaluation step leading to a cuda error (no more memory). With 11Gb of RAM on each GPU, I can run the training only with batch size = 32. The script is using, within an epoch, only half of the available RAM.
For the imagenet example, same problem, I fixed the batch_size to 80 to be able to train a VGG19_bn. Another problem, when restarting from a check point, the script dies at the end of each epoch with cuda error: no more memory, even with the same batch size.
Can anyone help me understand this burst of RAM?