How to clear some GPU memory?

Hello,

I put some data on a GPU using PyTorch and now I’m trying to take it off without killing my Python process. How can I do this?

Here was my attempt:

import torch
import numpy as np


n = 2**14
a_2GB = np.ones((n, n))  # RAM: +2GB
del a_2GB  # RAM: -2GB
a_2GB = np.ones((n, n))  # RAM: +2GB
a_2GB_torch = torch.from_numpy(a_2GB)  # RAM: Same
a_2GB_torch_gpu = a_2GB_torch.cuda()  # RAM: +0.9GB, VRAM: +2313MiB
del a_2GB  # RAM: Same, VRAM: Same
del a_2GB_torch_gpu  # RAM: Same, VRAM: Same
del a_2GB_torch  # RAM: -2GB, VRAM: Same
4 Likes

Even though nvidia-smi shows pytorch still uses 2GB of GPU memory, but it could be reused if needed.

After del try:

a_2GB_torch_gpu_2 = a_2GB_torch.cuda()
a_2GB_torch_gpu_3 = a_2GB_torch.cuda()  

you’ll find it out.

1 Like

Even if that same process can reuse the GPU memory, it doesn’t look like other processes can. I’m running into a similar utilization concern.

Another process will run into Out of Memory errors, while the original process keeps the GPU memory even after it is done using it.

2 Likes

That’s right. When there are multiple processes on one GPU that each use a PyTorch-style caching allocator there are corner cases where you can hit OOMs, but it’s very unlikely if all processes are allocating memory frequently (it happens when one proc’s cache is sitting on a bunch of unused memory and another is trying to malloc but doesn’t have anything left in its cache to free; if the first one were allocating at all it would hit the limit and know to free its cache). It could be improved, but it’s a lot better than frameworks that commandeer your whole GPU even if they’re only using 100MB…

2 Likes

I have run into a related issue while using the experimental Windows version. in my train phase, CUDA allocates about 4GBs for mini-batches and I optimize my params. Then when I am done and want to predict on a separate dataset, using the same mini-batch size, a fresh new 4GBs are allocated.

To be more precise, when i am done training, and nothing but the model should remain on the GPU, I can breakpoint and issue these commands: (all memory readings come from nvidia-smi):
T = torch.rand(1000,1000000).cuda() // Now memory reads 8GB (i.e. a further 4 GB was allocated, so the training 4GB was NOT considered ‘free’ by the cache-allocator, even though it was being reused during training)
del T // Still 8 GB (as expected)
T = torch.rand(1000,1000000).cuda() // Still 8GB as expected, the cache-allocator is reusing the same space as the first T above

So it looks like the 4GB from training are still taking up space on the GPU, even though they should be freed. But later they are being reused (when retraining the same model). I.e. they can be reused for the same purpose but not for arbitrary tensors - which makes no sense to me, of course.

Is there a way to manually force the caching allocator to free some GPU memory space? Or, since it seems that the cache-allocator doesn’t think the space is actually free - Can I pull my model.to_cpu() and then ask torch to free everything it has on the GPU?

2 Likes

For those who are facing similar memory issue, look at the autograd setting: Volatile.
It’s recommended for inference mode, to optimize the amount of memory used in evaluating the model .
http://pytorch.org/docs/master/notes/autograd.html#volatile

you could do sth like this:
volatile_input = Variable(torch.randn(1000,1000000), volatile=True).cuda()

5 Likes

good call, thanks, We already set our input variables in predict() to volatile=True. My impression is that GPU memory left committed from the training is being ‘hoarded’ and it is that memory that I would like to clear / free / repurpose. (I actually tried setting volatile=False, to all my variables in the predict method, but that didn’t fix the memory ‘leak’)

It is not memory leak, in newest PyTorch, you can use torch.cuda.empty_cache() to clear the cached memory.

8 Likes

I have the same problem as MatthewKleinsmith’s.
And I set the volatile=False and use torch.cuda.empty_cache() ,it still does not works.

If you already removed unwanted references to the Variables, empty_cache should definitely work. You can check by seeing the nvidia-smi values.

volatile=False is the default option. It will build the graph as it goes. Setting it makes no difference.

If you already removed unwanted references to the Variables, empty_cache should definitely work

@SimonW Could you elaborate what do you mean by remove unwanted references? Do you mean explicitly deleting variables (del variable) or something else?

After every epoch I’m calling torch.cuda.empty_cache(), but nvidia-smi still shows an increase in GPU memory after every loop.

    for epoch in range(20):
        for batch in train_data.batches:
            inputs, targets = batch
            predictions = model(inputs)
            predictions = predictions.view(-1, model.vocab_size, model.batch_size)
            targets = targets.view(-1, model.batch_size)
            loss = loss_function(predictions, targets)
            model.zero_grad()
            model.init_hidden()
            loss.backward()
            optimizer.step()
        tester.test(model)
        tester.print_samples()
        torch.cuda.empty_cache()
3 Likes

torch.cuda.empty_cache doesn’t give PyTorch extra GPU memory to use. See http://pytorch.org/docs/master/notes/cuda.html#memory-management. So it won’t help if you are solving a OOM with only PyTorch using that GPU.

The structure of your code segment looks fine. So it’s probably one of loss_function, init_hidden, tester.test, or tester.print_samples that’s causing the issue.

I realise where I was making a mistake. My model has an LSTM and I’m supposed to pass on a new, empty variable as the hidden state. If I pass on an existing variable, such as the hidden state from the previous timestep, the model backprops all the way back to the first epoch on every epoch of training. This is precisely why the GPU memory kept exploding after every epoch. I now have something like this, and it works fine.

def zero_hidden():
    return (torch.zeros(1, 1, hidden_dim),
            torch.zeros(1, 1, hidden_dim))

lstm_out, lstm_hidden = lstm(lstm_in, zero_hidden())
2 Likes

I don’t know how to use it lol

I have a similar problem to @nikhilweee. I have tried to clean garbage with calls to both torch.cuda.empty_cache() and torch.cuda.ipc_collect(). This works great in my CNN training loop. Reserved memory stays constant for any number of batches. But in the validation loop, the memory use climbs after each iteration. I found that the difference is the loss.backward() statement in the training loop is cleaning out the garbage somehow. Since this isn’t in validation, it just keeps piling up in spite of the calls to torch.cuda.ipc_collect().

Anybody know what’s going on in the backward method?

Hi, sir, torch.cuda.empty_cache() is really help. Recently, I also came across this problem. Normally, the tasks need 1G GPU memory and then steadily went up to 5G. If torch.cuda.empty_cache() was not called, the GPU memory usage would keep 5G. However, after calling this function, the GPU usage decrease to 1-2 G.

I am training an RL project with PyTorch 0.4.1. So, here I am still confused and cannot find reason. I used TF before and there is no such issue.

I was able to free probably all the GPU memory used by tensors by using the following sequence:

  • model.to(‘cpu’) # this allows moving to area where you probably have more memory
  • model_RAM_copy = model.state_dict()
  • delete_tensors() # this function goes through model children and deletes .weight, like del model.layer1.weight; I do not care about .bias, they are much smaller
  • torch.cuda.empty_cache()
  • model.to(‘cuda:0’)
  • model = newModel().to(‘cuda:0’) # recreating the model structure from scratch
  • criterion =… optimizer =…, lr_scheduler =…
  • torch.cuda.empty_cache()
  • model = load_state_dict(model_RAM_copy )

In fact, my code is a little longer, but I think there is some redunduncy, I just didn’t take time to optmize it. Anyway, it works in terms of memory usage (I can see a beutyful line of GPU Memory usage going up and down, within some stable limits). I am sure this can be done more smoothly, but I was looking for a solution for few days and ended up with this. Does not look pretty, but works. I am not sure about backpropagation correctness or preserving learning rate statistics. Network learns pretty ok, but it is possible that I am loosing some parameters on the way.