Does cuda cache memory for future usage?

I found some issue in cuda memory allocation when I follow the tutorial official website guideline

device = torch.device("cuda:0")
x = torch.ones(2, 2, requires_grad=True, device =device)

In this situation, the memory in gpu:0 is 863MB only when I create a 2 by 2 tensor array x.

tensor([[1., 1.],
        [1., 1.]], device='cuda:0', requires_grad=True)

so, I tried to add another tensor y to watch the change of memory

y = x + 2

But the memory in gpu:0 is still 863MB. Here you can see it really track the computation history in grad_fn

tensor([[3., 3.],
        [3., 3.]], device='cuda:0', grad_fn=<AddBackward0>)

and I also try to test with torch.no_grad(): mention in because it can prevent tracking history and using memory.

with torch.no_grad():
    z = y * y * 3
    out = z.mean()

here we can see the grad_fn in z disappear, so it would save the memory in gpu:0 theoretically. In this situation, the memory is still 863MB. Therefore, I can’t watch whether it would save memory exactly or not.

tensor([[27., 27.],
        [27., 27.]], device='cuda:0')

In conclusion,

1. What is the cuda memory allocation method in pytorch? Will it cache some memory for future usage? If so, how can I remove this mechanism? I don’t want to build a small model but take a lot of memory in future.

2. Do with torch.no_grad(): really prevent using memory? I can’t watch whether it would save memory or not in this situation.

  1. The majority of the mentioned used memory comes from the creation of the CUDA context, which cannot be freed. You should also be able to see this memory usage by creating an empty CUDATensor.
    The caching allocator will reserve some memory which will be used later and you cannot disable it at the moment. Using the caching allocator will avoid synchronizing memory allocations during the model execution and should therefore speedup the execution. To check the cached memory, you could use print(torch.cuda.memory_reserved()). This post has a small example to show the difference between allocated and reserved memory.

  2. Yes, no_grad() will avoid storing intermediate gradients, which would be needed during the backward pass to calculate the gradients. The memory difference of the small tensors in your example might not change the used memory in nvidia-smi, so you could increase the size and use torch.cuda.memory_allocated to check the difference.

Thanks a lot! You explained it very clearly!