Moving from GPU to CPU not freeing GPURAM

I’m not sure whether this is a bug or I’m not understanding how pytorch works, hence why I’m posted here first. I’m using pytorch 1.5.0.

My model is running on the gpu and I convert each batch to the device at the beginning, then forward through the model. When I then move it to CPU however, it doesn’t seem to free the GPU memory. When the loop comes around again, the memory still isn’t freed and ends up with an out of memory issue after a few loops.

mem_a = torch.cuda.memory_summary(device, True)

b = b.to(device)

mem_a2 = torch.cuda.memory_summary(device, True)

forward = model.forward(b)

mem_a21 = torch.cuda.memory_summary(device, True)

forward = forward.to("cpu")

mem_a3 = torch.cuda.memory_summary(device, True)

I’ve checked in the debugger, and the forward after the .to() statement says it is on the cpu, however if you look at the GPU ram below from these positions…

Loop 1:
mem_a = 238MB
mem_a2 = 778MB
mem_a21 = 3471MB
mem_a3 = 3471MB

Loop 2:
mem_a = 3471MB
mem_a2 = 3781MB
mem_a21 = 5382MB
mem_a3 = 5382MB

Loop 3:
mem_a = 5382
mem_a2 = 5693
Out of Memory

To me, it very much looks like pytorch isn’t clearing up the link to the GPU after moving to cpu. I’ve tried this in a number of ways, even separating the gpu/cpu variable and deleting the gpu one still doesn’t work, as if it maintains a link somewhere. If I delete while it’s still on gpu the memory usage disappears, so it definitely seems to be something around this .to() statement that is keeping a link to the GPU.

I’m needing to store the output of this all before being able to do what I do next, and have more than enough RAM for it if I can get it put properly get it out of the GPU, however I can’t seem to do that.

Am I doing something wrong here?

Can you try this one out:

torch.cuda.empty_cache()

I’ve tried adding it as below, but there has been no difference. I also tried gc.collect just to see if there was a variable being held within memory with no difference.

mem_a21 = torch.cuda.memory_summary(device, True)

forward = forward.to("cpu")
torch.cuda.empty_cache()

mem_a3 = torch.cuda.memory_summary(device, True)

I’m really unsure of where to debug from this point forward… I’ve also tried forward.cpu() and torch.device("cpu") so I don’t think it’s specifically an issue with the .to().

Are you collecting results, e.g. losses, and storing them in a list or something similar? If you are, make sure you store graph-detached copies of the relevant tensors by calling Tensor.detach.

losses = []

# Inside training loop or similar
loss = criterion(...)
losses.append(loss.detach())

If you store tensors attached to a graph, invoking garbage collection will have no impact since those tensors are still actively referenced via the list.

Hi everyboday,
any updates for this issue?
It seems that I came into the same situation.

Best,
Jianxiang

In my case, it turns out that the now marked answer (that I forgot to mark as solution back then) was the issue/solution in my case.

Thank you for the quick reply!
Since I got into this during inference in which no gradients are required, I found that other people suggested to use with torch.no_grad(): before doing inference in the mini-batch. And it worked fine for me.