I'm trying to rewrite the CUDA cache memory allocator

raining_day513 · April 24, 2023, 3:05pm

Hi, I’m trying to temporarily disable the pre-allocation of CUDA GPU memory to investigate the CUDA GPU memory usage of some experiments with my model. To be clear,

I want to make sure that GPU is allocated as late as possible, so I can read the correct and exact memory usage of each layer of my model. (I understood that cudaMalloc is an expensive operation, but that’s fine in my case!)
I want to release the GPU memory as soon as possible. From my current understanding, this has been done decently in the current code base. I just simply mentioned it here.

From my current understanding, I will need to investigate cudaMalloc, and the pre-allocation is controlled by the cache memory allocator as mentioned by my title.

Am I on the right track this way trying to understand the source of CUDAMallocAsyncAllocator? It will help me a lot if someone did this before and could share it with me. Or you might point out what’s the problem with this track and point me to another recommended one. Thanks!

ptrblck · April 24, 2023, 4:33pm

You can disable the caching allocator via export PYTORCH_NO_CUDA_MEMORY_CACHING=1 which will then use cudaMalloc/Free calls without reusing any memory.

raining_day513 · April 24, 2023, 4:52pm

Hi, ptrblck, thanks for your advice! I will try it for sure!

Before I found your answer I was thinking about using both:

torch.cuda.max_memory_reserved
torch.cuda.reset_peak_memory_stats

in the register_forward_hook of nested nn.Module to measure the CUDA GPU memory of each layer of my model as if the CUDA caching mechanism of PyTorch is not enabled. (In detail: I was thinking about subtracting the value returned by the first API from the GPU memory usage including caches.) Do you think this a good idea? Thanks for your kindness (and I probably used the wrong tag, let me fix it now!)

ptrblck · April 24, 2023, 7:23pm

Instead of checking the reserved memory (which includes the cache) and subtracting the reported memory by nvidia-smi I would just use torch.cuda.memory_allocated which will return the used and allocated memory only.

ChineseBest · April 25, 2023, 2:27pm

I agree, memory_allocated will be a better idea. I think raining_day513 is in the right track.