Tensor Creation uses 0.5 MB of GPU memory?

I recently stumbled upon something I don’t understand: When creating a float-tensor on GPU with only 1 element, I would assume it to take up 4 bytes of memory. However, torch.cuda.max_memory_allocated() returns 512 bytes? Why is that?

MWE to replicate:

import torch 
a = torch.tensor(1.0, device='cuda')
print(torch.cuda.max_memory_allocated())   # 512 bytes 
print(torch.cuda.max_memory_reserved())    # 2097152 bytes = 2 MB 

I’m aware that the CUDA context must be created on the GPU as well, which is why nvidia-smi shows values much higher than 0.5MB, around 1GB, but that’s not what I’m asking here. Why does PyTorch reserve half a MB for a 4byte tensor, and why does it cache 2MB when creating said tensor?

Following this answer on Stackoverflow, I did

import sys 
print(sys.getsizeof(a))            # 64
print(sys.getsizeof(a.storage())   # 60

which unfortunately is even more confusing. Anybody knows what’s going on here?

that’s because pytorch manages its own buffers, you don’t want to do 4 byte allocations from “system” memory manager

Thanks, any resources on where I can read up on this?

not sure, that’s rather a common practice for high performance c++ programming - using specialized allocators tuned for program’s allocation patterns. with cuda’s limited memory it is even more essential, mostly to decrease fragmentation.

but, yeah, this happens to be documented - CUDA semantics — PyTorch master documentation