Control GPU Memory cache

Iamidiot · March 20, 2023, 8:35am

I am using U-net modified as 3D Convolution version. The problem here is that pytorch takes a very large memory size for my 3D U-net. Initially, I only set the training batch size to 2 because of this problem.

torch.cuda.memory_allocated() outputs low memory usage (around 5GB), but torch.cuda.max_memory_allocated() outputs high memory usage (around 36GB).

At first, I thought it was a problem caused by the excessively large size of 3D features stored due to U-net’s skip connection, but it wasn’t. I found that just by performing a 3D convolution of size 128 x 128 x 128, Pytorch allocates a huge cache in memory that is not used by subsequent processes.

In order to build the network deeper and increase the number of batches, I solved the problem by putting torch.cuda.reset_max_memory_allocated(device), torch.cuda.empty_cache() in the middle of the forward() function, but I don’t think this is nice solution…
Doing torch.cuda.empty_cache() slows down the network.

Why does this cache problem occur?
Is there any better solution?

soulitzer · March 20, 2023, 8:18pm

Maybe what you are seeing is how PyTorch manages memory using the CUDA caching allocator. Even if the program does not subsequently use that region of memory, the memory already allocated by PyTorch would not be returned to the device to avoid excessive cudamalloc/free calls every iteration.

Calling empty_cache can be used to alleviate memory fragmentation, which may also be an issue sometimes, but it does slow down your iterations yes.

Autograd needs to save tensors for backward in order to recompute the backward pass (but not sure what you mean by “not used by subsequent processes”), so if you want to scale your model but are bottlenecked by memory usage, another solution is to use activation checkpointing, which allows you to save fewer activations for backward during forward in exchange for having to recompute them during backward.