Huge gap between cuda allocated and reserved memories

Hi,

I am using PyTorch for a small project of mine and I noticed weird values for CUDA allocated memory and reserved memory. Indeed, while training my model, CUDA reserved memory hits 12 GB while the allocated memory does not exceed 1.5 GB. Even after using torch.cuda.empty_cache(), the gap between the two remains. I checked and no other process runs on the GPU I’m using.

As far as I understand, based on this topic, the reserved memory includes pre-cached memory so it should be greater than the allocated memory, but I don’t understand why I obersve such a difference.

When getting OOM errors, PyTorch suggest setting the max_split_size_mb of the PYTORCH_CUDA_ALLOC_CONF variable but I have no idea what values could help resolve my issue.

So my questions are:

  • why are the reserved and allocated memories so different ?
  • how could I lower the reserved memory values (in contrast to the allocated memory) to avoid OOM errors ?

Thanks for your time

3 Likes

Reserved memory contains the allocated and cached memory. So I think they should be roughly the same after calling torch.cuda.empty_cache(). But I too experienced otherwise. So I’m also looking for an answer. (Already asked in this comment.)