I'm trying to rewrite the CUDA cache memory allocator

Hi, I’m trying to temporarily disable the pre-allocation of CUDA GPU memory to investigate the CUDA GPU memory usage of some experiments with my model. To be clear,

  1. I want to make sure that GPU is allocated as late as possible, so I can read the correct and exact memory usage of each layer of my model. (I understood that cudaMalloc is an expensive operation, but that’s fine in my case!)
  2. I want to release the GPU memory as soon as possible. From my current understanding, this has been done decently in the current code base. I just simply mentioned it here.

From my current understanding, I will need to investigate cudaMalloc, and the pre-allocation is controlled by the cache memory allocator as mentioned by my title.

Am I on the right track this way trying to understand the source of CUDAMallocAsyncAllocator? It will help me a lot if someone did this before and could share it with me. Or you might point out what’s the problem with this track and point me to another recommended one. Thanks!

You can disable the caching allocator via export PYTORCH_NO_CUDA_MEMORY_CACHING=1 which will then use cudaMalloc/Free calls without reusing any memory.

Hi, ptrblck, thanks for your advice! I will try it for sure!

Before I found your answer I was thinking about using both:

  • torch.cuda.max_memory_reserved
  • torch.cuda.reset_peak_memory_stats

in the register_forward_hook of nested nn.Module to measure the CUDA GPU memory of each layer of my model as if the CUDA caching mechanism of PyTorch is not enabled. (In detail: I was thinking about subtracting the value returned by the first API from the GPU memory usage including caches.) Do you think this a good idea? Thanks for your kindness :slight_smile: (and I probably used the wrong tag, let me fix it now!)

Instead of checking the reserved memory (which includes the cache) and subtracting the reported memory by nvidia-smi I would just use torch.cuda.memory_allocated which will return the used and allocated memory only.

I agree, memory_allocated will be a better idea. I think raining_day513 is in the right track.

1 Like