How do I rewrite the GPU memory allocation algorithm of PyTorch?

Hi, from my current browsing of the documentation, it seems that the only way to provide a custom CUDA memory allocator is by the CUDAPluggableAllocator class, correct?

What I want to achieve is that given a simple linear model:

in-> A->B->C->D->E-> out

I want to be able to control where the GPU memory of these 5 nodes(A~E) will be allocated/stored.(in fact, it will be great if I can control the allocation of weights between these nodes too) It’s related to the gradient-checkpointing technique. Let’s say C is the only checkpoint so this model is divided into two parts (excluding the checkpoint itself):

[A,B] C [D,E]

It will be great if I can decide where to put the GPU memory of the three chunks: C, [A,B], [D,E], instead of letting it be managed by PyTorch by default. I might be in the wrong direction trying to rewrite CUDA memory allocator for achieving this, so please help! If you know that PyTorch already has some good tutorials/documentation for this, I will be grateful if you could share it.

Thanks for your reading!

I don’t think a custom CUDA allocator would help here as your use case sounds more like CPU-offloading. This post might be helpful.

1 Like

I’m new to the term CPU-offloading. Could you elaborate more on why it’s related to my case? You might have understood my use case, but forgiving me to clarify it again to avoid confusion:

  1. My target is to reduce the GPU memory usage shown by the nvidia-smi command. i.e. I don’t want it to have a higher value than what is actually used by my model. (it’s acceptable if the value is just a little higher, e.g. <1GB)
  2. In my current understanding, the value is higher because PyTorch probably allocates GPU memory aggressively. It will be great if I can make PyTorch follow the instructions of my algorithm so that I can avoid point 1.

Thanks again for your reading!

PyTorch’s memory allocator isn’t aggressively allocating memory, but uses an internal caching mechanism to be able to reuse memory instead of re-allocating it with synchronizing and thus expensive cudaMalloc calls. If you don’t want to use this cache, you can free it for a performance penalty via torch.cuda.empty_cache().

Will the internal caching mechanism cache more memory than the model actually uses? I didn’t mean I want to avoid the caching mechanism, but I want to make sure that the internal caching mechanism is done efficiently.

No, it will only allocate needed memory pages and increase the overall usage if needed.
A slight overhead might be visible as the page sizes are pre-defined.

But this is not what I observed. I guess we might have different interpretations of the word “allocated”. I think I probably used the wrong wording. When I said “allocated memory” I was trying to describe the value shown by command nvidia-smi.

I have inserted many calls of torch.cuda.memory_allocated in those hooks(e.g. register_forward_hook, register_full_backward_hook) of my model and the highest point shown in my GPU memory graph is around ~7.5GB. But from nvidia-smi I saw usage always ~15GB.

Am I correct to say that PyTorch will reserve(cache) GPU memory aggressively for later use?

PyTorch will allocate memory it needs and move it to its cache if it’s not needed anymore instead of freeing it (it will never cudaFree its own memory unless it’s running into an OOM and retries the allocation). I don’t know if you consider this as “PyTorch will reserve(cache) aggressively”.

Say I can predict the GPU memory usage of my model by calculating the model weights and intermediate data/activations in the forwarding and gradients(according to your reply in another thread, I believe they’re allocated when needed) in backwarding, and I get the expected size say 8GiB, then I will consider PyTorch reserve(cache) more than enough if the value given by nvidia-smi is 15GiB. (I think we should not focus on the wording I used. English is not my mother tongue. I apologize for this if you don’t like the wording aggressively.)

I make the conclusion "PyTorch might reserve GPU more than enough " because I haven’t investigated the source to understand how it is implemented. And for me the good thing of PyTorch is that it provides a forum so I should first ask some experts before reading the source. This is my intention of asking.

What I want to understand is why nvidia-smi show much more GPU memory than the value given by the API memory_allocated.


nvidia-smi shows the memory held by all applications. Assuming only your PyTorch script is running it’s peak memory was higher at one point, e.g. if the intermediate forward activations were stored for gradient computation, etc.