Convert float32 to float16 with reduced GPU memory cost

That’s expected, as you are still holding references to both tensors.

This shouldn’t be the case, as a would be released and the memory would be added to the cache as seen here:

# create a tensor of 4MB
a = torch.randn(1024, 1024, device='cuda')
print(torch.cuda.memory_allocated() / 1024**2)
> 4.0
print(torch.cuda.memory_reserved()/ 1024**2)
> 20.0

# create a FP16 copy of 2MB
a_fp16 = a.to(torch.float16)
print(torch.cuda.memory_allocated() / 1024**2)
> 6.0
print(torch.cuda.memory_reserved()/ 1024**2)
> 20.0

# delete 4MB tensor
del a
print(torch.cuda.memory_allocated() / 1024**2) 
> 2.0
print(torch.cuda.memory_reserved()/ 1024**2)
> 20.0