GPU RAM fragmentation diagnostics

colesbury · January 22, 2019, 10:43pm

I think Pascal GPUs operate with 2 MB pages. (At least this appears to be the case on a P100; none of this is well documented).

I find that cudaMalloc successfully allocates non-contiguous memory.

Yes – to a certain extent. It can re-map pages, but not sub-pages. So you can allocate all your memory in chunks of the pages size, free every other allocation (so have lots of “holes”), and then allocate all the remaining space in one contiguous chunk. However, if your wholes are smaller than the page-size it won’t help. (For example, try allocating all memory in chunks of 2097153 bytes using the CUDA API. How much memory can you allocate?).

PyTorch takes advantage of this behavior by freeing all unused, cached allocations when an allocation fails. This allows the driver to remap pages to make larger contiguous chunks. (And then we retry the allocation)

However, PyTorch isn’t able to release portions of blocks that are partially used (because there’s no CUDA API to do so). The CUDA driver is more powerful than PyTorch (since it can remap pages). If our goal were only to reduce fragmentation we would just use cudaMalloc and cudaFree directly. But cudaFree synchronizes the host with the GPU, which can hurt performance.

NVIDIA is much better positioned than PyTorch to write a good caching allocator as part of the CUDA API, but they don’t seem inclined to do so. (I’ve asked).

Note that there are two-levels of potential fragmentation: fragmentation due to PyTorch and fragmentation due to the CUDA driver.

To answer some of your previous questions:

torch.ones((d, d)).cuda() will always allocate a contiguous block of GPU RAM (in the virtual address space)
Your allocation x3 = mem_get(1024) likely succeeds because PyTorch cudaFree’s x1 on failure and retries the allocation. (And as you saw, the CUDA driver can re-map pages).
PyTorch uses “best-fit” among cached blocks (i.e. smallest block). If there’s no block, it will try to cudaMalloc a new block.

EDIT: See THCCachingAllocator.cpp for source code