How does memory allocation work precisely in PyTorch?

Hello,

I’d like to understand in depth what happens exactly when you create a new Tensor. I know memory in PyTorch/libtorch is cached, hence an already allocated memory block is queried before possibly allocating new memory. But how many memory blocks does the cache hold? What is their size? Is there any event which would cause deallocation or the cache grows indefinitely until empty_cache() is explicitly called?

Thank you,

1 Like

It depends on the type of tensor: if it’s a CPU tensor, you can see the implementation here: pytorch/c10/core/CPUAllocator.cpp at main · pytorch/pytorch · GitHub / pytorch/c10/core/impl/alloc_cpu.cpp at main · pytorch/pytorch · GitHub .
By default, on most systems, it will be based on posix_memalign from libc, which is an arena allocator (like malloc), but you can replace allocators in libc of course.

For CUDA tensors, it’s more complicated because PyTorch runs its own caching allocator, the implementation is here: pytorch/c10/cuda/CUDACachingAllocator.cpp at main · pytorch/pytorch · GitHub
The block size and other config parameters are defined here: pytorch/c10/core/AllocatorConfig.h at main · pytorch/pytorch · GitHub

In general, you can query memory statistics from the allocator, see torch.cuda.memory.memory_stats — PyTorch 2.10 documentation . See also the general notes on CUDA memory management here: CUDA semantics — PyTorch 2.10 documentation

I am answering from the perspective of CUDA tensors:

It depends on how much memory is available to PyTorch and what your usage pattern is like. You can limit the available memory using per_process_memory_fraction for example. PyTorch does not pre-allocate memory. Allocations happen when you create a tensor. The number of blocks that PyTorch holds will depend on what your allocations have looked like up until that point in the application.

There are two types of blocks held by CUDACachingAllocator. Small blocks and Large blocks. The small blocks are 2MB in size and large blocks are 20MB each but that can be configured using large_segment_size_mb.