That’s expected, as you are still holding references to both tensors.
This shouldn’t be the case, as a
would be released and the memory would be added to the cache as seen here:
# create a tensor of 4MB
a = torch.randn(1024, 1024, device='cuda')
print(torch.cuda.memory_allocated() / 1024**2)
> 4.0
print(torch.cuda.memory_reserved()/ 1024**2)
> 20.0
# create a FP16 copy of 2MB
a_fp16 = a.to(torch.float16)
print(torch.cuda.memory_allocated() / 1024**2)
> 6.0
print(torch.cuda.memory_reserved()/ 1024**2)
> 20.0
# delete 4MB tensor
del a
print(torch.cuda.memory_allocated() / 1024**2)
> 2.0
print(torch.cuda.memory_reserved()/ 1024**2)
> 20.0