Asking for more than 128G of memory

Hi all,

So I trained a 3D-UNet with 16 base filters and 5 layers deep. Now I am trying to infer it on a 240x240x155 on a CPU. I have allocated 128GB of ram, it still pops out with an error.

RuntimeError: $ Torch: not enough memory: you tried to allocate 0GB. Buy new RAM! at /opt/conda/conda-bld/pytorch

I do not have more money to buy new ram, The model should require at the most 32GB of ram for that image.

Can I know where I may be going wrong?


Can you post the code for the model? Also did this occur when executing a TorchScript function/module or a normal nn.Module?

I cannot post the model since it is an ongoing work. But I can confirm that I trained this model on a 16GB GPU.

As far as I understand your issue, the training script takes 16GB at most running on the GPU and more than 128GB on the CPU?
If that’s correct, do you see an increasing memory usage during training or does your script run out of memory during the first iteration?
Did you change something in your data loading pipeline, e.g. are you loading the complete dataset into RAM?

Oh I was not even talking about training, it is the cost of inference on a single example.