How to delete a Tensor in GPU to free up memory

Correct me if I’m wrong but I load an image and convert it to torch tensor and cuda(). So when I do that and run torch.cuda.memory_allocated(), it goes from 0 to some memory allocated. But then, I delete the image using del and then I run torch.cuda.reset_max_memory_allocated() and torch.cuda.empty_cache(), I see no change in torch.cuda.memory_allocated(). What should I do?

1 Like

That’s the right approach, which also works for me:

path = '...'
image = Image.open(path)

print(torch.cuda.memory_allocated())
> 0
print(torch.cuda.memory_reserved())
> 0

x = transforms.ToTensor()(image)
print(torch.cuda.memory_allocated())
> 0
print(torch.cuda.memory_reserved())
> 0

x = x.cuda()
print(torch.cuda.memory_allocated())
> 23068672
print(torch.cuda.memory_reserved())
> 23068672

del x
print(torch.cuda.memory_allocated())
> 0
print(torch.cuda.memory_reserved())
> 23068672

torch.cuda.empty_cache()
print(torch.cuda.memory_allocated())
> 0
print(torch.cuda.memory_reserved())
> 0
12 Likes

I don’t know how but now it works. Pytorch is full of surprises.

One question though. Why does the memory still show up in nvidia-smi? Will this affect me overall memory usage?

1 Like

The first CUDA operation will create the CUDA context containing the native CUDA kernels, cudnn etc. and will use the memory until the application is closed.

2 Likes

I had the same problem with del x and torch.cuda.empty_cache() not removing everything off the GPU. Eventually I wrapped the for loop in with torch.no_grad() and now it works. I think if the gradient is turned on, it saves intermediate steps, even if you delete the final product. Therefore, turning off the gradient should solve (some people’s) problems.

2 Likes

Yes, either of these will do the job:

with torch.no_grad():

Or

x.detach().cpu()

1 Like

Once the loop is done (say with no_grad on), is there a way to iterate through and delete these intermediate computations?

Hi,

I still see the memory capacity insistently remains unchanged upon the following two different codes;

x.detach(), del x, torch.cuda.empty_cache()

I checked the attribute of x right after x.detach().cpu() and x still has

is_cuda True,
grad_fun SelectBackward,
requires_grad True

So the second code is with replacing x.detach().cpu() with x= x.detach().cpu() and see the x attributes right after x= x.detach().cpu() as follows;

is_cuda False,
grad_fun None,
requires_grad False

Even requires_grad and backward related attributes go away successfully, x still remains on gpu memory. Is there any insight to delet x out of gpu memory ?

I am hesitated to use with torch.no_grad(): because I need a few variables in a loop to have requires_grad True for backward operartion.

Thank you in advance.

Same problems here. Try to convert a fp32 tensor to fp16 tensor with tensor.half(), and deleting the original fp32 tensor from memory. None of these codes work.

It should work as described and verified here. Could you post an executable code snippet, which shows that it’s not working as intended, please?

Is there any solution for this ? We work on a shared server, and sometimes I need to free gpu memory for other users without killing the whole kernel. Your code indeed frees the reserved memory (torch.cuda.memory_reserved() returns 0), but nvidia-smi still shows that my kernel is occupying the memory.

PS : I use jupyter-lab, that’s why sometimes I still need the kernel after that my model has finished training.

1 Like

nvidia-smi will show the allocated memory by all processes. If you are only running PyTorch then the CUDA context would still use device memory (~1GB depending on the GPU etc.) and cannot be released without stopping the Python kernel.

Thank you for your reply. I am afraid that nvidia-smi shows all the GPU memory that is occupied by my notebook. For instance, if I train a model that needs 15 GB of GPU memory, and that I free the space using torch (by following the procedure in your code) , the torch.cuda.memory_reserved() will return 0, but nvidia-smi would still show 15GB.

nvidia-smi indeed shows all allocated memory, so if it’s still showing 15GB then some applications are still using it. If you are not seeing any memory usage (either allocated or in the cache) via torch.cuda.memory_summary(), another application (or python kernel) would use the device memory.

tensor = torch.ones((1,3,512,512))
print(torch.cuda.memory_allocated())
print(torch.cuda.memory_reserved())

tensor_cu = tensor.cuda()
print(torch.cuda.memory_allocated())
print(torch.cuda.memory_reserved())

another_tensor_cu = torch.ones((1,3,512,512)).cuda()
print(torch.cuda.memory_allocated())
print(torch.cuda.memory_reserved())

del tensor_cu
print(torch.cuda.memory_allocated())
print(torch.cuda.memory_reserved())

torch.cuda.empty_cache()
print(torch.cuda.memory_allocated())
print(torch.cuda.memory_reserved())

del another_tensor_cu
torch.cuda.empty_cache()
print(torch.cuda.memory_allocated())
print(torch.cuda.memory_reserved())

result:
0
0
3145728
20971520
6291456
20971520
3145728
20971520
3145728
20971520
0
0

conclusion:
you are corrent, and i want to supplement that deleting a single tensor from GPU does not affect other tensors.

Any solution here? Memory is not freed, see minimal example:


import os
os.environ["CUDA_VISIBLE_DEVICES"] = "6"
import torch
from transformers import LlamaForCausalLM

model = LlamaForCausalLM.from_pretrained(
            pretrained_model_name_or_path='decapoda-research/llama-7b-hf', 
            load_in_8bit=True,
            device_map={'': 0},
        )

del model
torch.cuda.empty_cache()

print('breakpoint here - is memory freed?')

And here’s the solution:


import os
os.environ["CUDA_VISIBLE_DEVICES"] = "6"
import torch
from transformers import LlamaForCausalLM

model = LlamaForCausalLM.from_pretrained(
            pretrained_model_name_or_path='decapoda-research/llama-7b-hf',
            load_in_8bit=True,
            device_map={'': 0},
        )

del model
import gc
gc.collect()
torch.cuda.empty_cache()

print('breakpoint here - is memory freed?')
2 Likes

As soon as i load the model, the memory usage goes to 793MB, which I believe is memory occupied by CUDA context, as nothing is passed to the model at this stage. And also even if I delete the two models loaded (model and feature segmentor) and clear the cache with torch.cuda.empty_cache() to check the memory status, the memory usage stays the same 793MB.

Then at the next step, when I execute the following code:

with torch.no_grad():

        model.eval()
        avg_train_dice = []
        for img in range(len(dataset_val)):  # looping over all 3D files

            train_samples, gt_samples, voxel = dataset_val[img]  # Get the ith image, label, and voxel   
            stronger_predictions = []
            predictions = []

            for slice_id, img_slice in enumerate(train_samples): # looping over single img             
                img_slice = img_slice.unsqueeze(0)
                img_slice = img_slice.to(device)
                stronger_pred = model(img_slice)
                stronger_pred = stronger_pred.detach().cpu()
                stronger_predictions.append(stronger_pred.squeeze())
                del img_slice
                del stronger_pred
                torch.cuda.empty_cache()
 
            stronger_preds = torch.stack(stronger_predictions, dim= 0)
        
            stronger_predictions.clear()
            stronger_preds_prob = torch.sigmoid(stronger_preds)

         
            if n_channels_out == 1:
                train_dice = sdice(gt_samples.squeeze().numpy()>0,
                                    stronger_preds_prob.numpy() > 0.5,
                                    voxel[img])
            else:
                train_dice =  dice_score(torch.argmax(stronger_preds_prob, dim=1) ,torch.argmax(gt_samples, dim=1), n_outputs=n_channels_out)

            avg_train_dice.append(train_dice)

     
        avg_train_dice = np.mean(avg_train_dice)

After this step completes, the memory usagreaches to 23723MB, but I was not calculating graidents in this step with torch.no_grad() specified, and deleted the tensors and cleared the cache too as can be seen in the code. I am not able to understand that why it is happening, since their is no gradient calculation, the data has been loaded and un loaded (confirmed by GPU utilization %), but still memory usage goes from the intial 793MB to 23723MB.

So, the memory is almost full and as soon as I go to my next step the following code:

for epoch in range(1, num_epochs + 1):
        model.train()
        train_loss_total = 0.0

        num_steps = 0
        for i, batch in enumerate(train_loader):
            input_samples, gt_samples, _ = batch
            var_input = input_samples.cuda(device)
            try:
                stronger_preds = model(var_input)
            except:
                embed()

            if level == 0:
                layer_activations = model.init_path(var_input)
                preds = features_segmenter(layer_activations)
                del var_input
                # embed()
            else:  # level = 1
                layer_activations_0 = model.init_path(var_input)
                layer_activations_1 = model.down1(layer_activations_0)
                logits_ = features_segmenter(layer_activations_1)
                preds = F.interpolate(logits_, scale_factor=2, mode='bilinear')

            if n_channels_out == 1:
                stronger_preds_prob = torch.sigmoid(stronger_preds)
                loss = weighted_cross_entropy_with_logits(preds, stronger_preds_prob)
                # loss = weighted_cross_entropy_with_logits(preds, stronger_preds)
            else:

                # loss = -torch.mean(F.log_softmax(preds, dim=1)*F.softmax(stronger_preds, dim=1)) 
                loss = CE_loss(preds, torch.argmax(stronger_preds, dim=1))          

            train_loss_total += loss.item()
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            num_steps += 1

        train_loss_total_avg = train_loss_total / num_steps
        num_steps = 0
        print('avg train loss', train_loss_total_avg)

The above code breaks after 1st iteration and generates the following error as no memory is left.
OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 23.70 GiB total capacity; 21.92 GiB already allocated; 33.00 MiB free; 22.29 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Can you please help me with this?

That’s correct.

Could you post a minimal and executable code snippet reproducing the described observations?

Hi ptrblck,

The above simple tensor deletion seems fine to me however, when there is gets a little more complex, I found GPU memory are not released. The code follows.

import torch
import os

hidden_size = 1000

def print_cuda_mem():
    print(f"allocated: {torch.cuda.memory_allocated() / 1e6}MB, max: {torch.cuda.max_memory_allocated() / 1e6}MB, reserved: {torch.cuda.memory_reserved() / 1e6}MB")


class Model(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = torch.nn.Linear(hidden_size, hidden_size)

    def forward(self, x):
        return self.linear(x)


def test_mem():
    print_cuda_mem()
    rand_input = torch.rand(hidden_size, hidden_size, device='cuda')
    print_cuda_mem()
    s1 = torch.cuda.memory_snapshot()
    model = Model().cuda()
    s2 = torch.cuda.memory_snapshot()
    print_cuda_mem()
    with torch.no_grad():
        y = model(rand_input).sum().detach().cpu()
    print(y)
    s3 = torch.cuda.memory_snapshot()
    print_cuda_mem()
    del rand_input
    del model
    del y
    s4 = torch.cuda.memory_snapshot()
    torch.cuda.empty_cache()
    import gc
    gc.collect()
    s5 = torch.cuda.memory_snapshot()
    print_cuda_mem()


if __name__ == '__main__':
    test_mem()
    import gc
    gc.collect()
    s6 = torch.cuda.memory_snapshot()
    print_cuda_mem()

An example output is

allocated: 0.0MB, max: 0.0MB, reserved: 0.0MB
allocated: 4.000256MB, max: 4.000256MB, reserved: 20.97152MB
allocated: 8.004608MB, max: 8.004608MB, reserved: 23.068672MB
tensor(7986.5176)
allocated: 16.97536MB, max: 22.024192MB, reserved: 23.068672MB
allocated: 8.970752MB, max: 22.024192MB, reserved: 20.97152MB
allocated: 8.970752MB, max: 22.024192MB, reserved: 20.97152MB

Even with manual gc and with torch.no_grad(), I found that there is still active memory usage (allocated: 8.970752MB) after s4 (s4, s5 and s6). I also looked deeper into the snapshot and confirmed that the active memory is created between s2 and s3, the model forwarding.

Any ideas and thanks! PyTorch version: 2.1.1+cu118