Monte Carlo Sampling: CUDA Out of memory

mflova · August 18, 2020, 9:50am

Hi!
I am trying to perform Monte Carlo sampling to estimate the uncertainty of my network. To do so, I have dropout enabled and I am executing following loop:

outputs = [self._model(inputs) for i in range(X)]

Where “inputs” is just a single frame.
First I tried this code in Google Colab, and everything was working fine, even with X = 40 and batch_size = 100 (frames) at once.
When I switch back to a personal computer with a 6GB dedicated memory GPU, if X is greater than 5 (with one single frame), I get the famous error “CUDA Out of memory”. Even if I split the code such that:

outputs = [self._model(inputs) for i in range(X/2)]
outputs = [self._model(inputs) for i in range(X/2)]

The code gives me same results.
Why is this happening? Is the for loop being parallelized by CUDA API and thus collapsing the memory? How can I solve this?

Thank you in advance

mapostig · August 18, 2020, 11:40am

Hello,

I don’t know if this can solve your issue. But I was having a similar one (i am looping a model over different folds) and at the second loop I obtained this famous memory error.
At the end of the loop I free the memory in the following way:

for element in dir():

    if element[0:2] != "__":

        del globals()[element]

import torch

torch.cuda.empty_cache()

I found this solution on Kite

I hope it solves your problem

mflova · August 18, 2020, 9:29pm

First of all, thank you for your response.
I have tried your approach. However, I still need so many variables in my script, so I cannot do that, as the programm wouldn’t work as a consequence. If I only use “torch.cuda.empty_cache()”, I still get “CUDA out of memory”, so aparently I cannot solve it with that.

EDIT:
This crashes:

        outputs = self._model(inputs)
        outputs2 = self._model(inputs)
        outputs3 = self._model(inputs)
        outputs4 = self._model(inputs)
        outputs5 = self._model(inputs)
        outputs6 = self._model(inputs)
        outputs7 = self._model(inputs)
        outputs8 = self._model(inputs)
        outputs9 = self._model(inputs)

While this not:

        outputs = self._model(inputs)
        outputs = self._model(inputs)
        outputs = self._model(inputs)
        outputs = self._model(inputs)
        outputs = self._model(inputs)
        outputs = self._model(inputs)
        outputs = self._model(inputs)
        outputs = self._model(inputs)
        outputs = self._model(inputs)

So it is clearly about the memory used to allocate the output tensors in the GPU. I have tried to first replicate the outputs in a CPU tensor before computing more samples, but it is still not working.