How to remove variables which are converted to gpu tensors inline?

shifathosn · January 31, 2024, 5:17am

Hi there. I am trying to run a simple model by giving some input data to it. The model tensors are in cuda device. The way I am feeding the data to the model is by converting the cpu tensor to the cuda tensor by inline conversion right at the model input output = model(input.cuda()).

But after running this line I could see that the used memory of the gpu is increased (which is very normal). But after running the same line again, the newly created input gpu tensor (with the inline conversion) is again allocated to the gpu, which I guess is not normal.

In this case, is there any way to remove the previously created gpu tensor and deallocate gpu memory? Is there any way to find the variable handles which are created inline as shown in the example?

N.B. I know I could convert the cpu tensors to a gpu tensor and store in a variable to retain the variable handle (gpuTensor = cpuTensor.cuda()). And then when the work is done I could simply run del gpuTensor. That would deallocate the gpu memory. But the question is not how to remove a gpu tensor.

Thanks.

ptrblck · January 31, 2024, 6:17am

PyTorch will move the memory back to the cache once all references to the tensor are gone. How did you verify that new memory is allocated for the temp input?

shifathosn · February 1, 2024, 7:30pm

I have used nvidia-smi and nvidia-htop.py to see the allocated memory. My basic understanding is: re-running a code block with the same variable alias should not occupy more memory as the previous tensor will lose reference to a variable. This should also be true for temporary allocated memory blocks.

ptrblck · February 1, 2024, 8:10pm

That’s right and is also the case as seen here:

import torch


print(torch.cuda.memory_allocated() / 1024**2)
# 0.0
print(torch.cuda.memory_reserved() / 1024**2)
# 0.0

def fun():
    # create 5MB tensor
    x = torch.randn(5 * 1024**2 // 4, device="cuda")
    print("inside ", torch.cuda.memory_allocated() / 1024**2)
    print("inside ", torch.cuda.memory_reserved() / 1024**2)
    
for _ in range(5):
    fun()
    print("outside ", torch.cuda.memory_allocated() / 1024**2)
    print("outside ", torch.cuda.memory_reserved() / 1024**2)
    print("="*10)

# inside  5.0
# inside  20.0
# outside  0.0
# outside  20.0
# ==========
# inside  5.0
# inside  20.0
# outside  0.0
# outside  20.0
# ==========
# inside  5.0
# inside  20.0
# outside  0.0
# outside  20.0
# ==========
# inside  5.0
# inside  20.0
# outside  0.0
# outside  20.0
# ==========
# inside  5.0
# inside  20.0
# outside  0.0
# outside  20.0
# ==========

nvidia-smi will show the GPU memory usage of all processes and thus also from the CUDA context etc., so is not trivial to map to tensor data.
As you can see in my example, the temp. tensor will use memory while it’s inside the function scope (seen via torch.cuda.memory_allocated()) and its memory will be released again in the for loop. The cache is allocated to 20MB and will be reused.