How to reduce maximum cache size on GPU?

I am running transformers from the HuggingFace transformers library and encountering issues with large GPU cache size. When running on a 3090, my max memory usage (torch.cuda.max_memory_allocated()) is 7GB, while the maximum allocated memory allocated (torch.cuda.max_memory_reserved()) is 12GB. This seems alright. However, when I run on an A6000, my max memory usage is still 7GB, but my allocated memory rises to 43GB.

I am wondering how the PyTorch caching mechanism works, i.e., how does PyTorch determine the maximum amount of memory to reserve, and can this be reduced? My motivation is that other folks may be running jobs on the machine I am using, so I do not want to hog the GPU memory. I saw there are some configuration variables for the cache / allocator on this page: CUDA semantics — PyTorch 2.2 documentation but they do not seem to accomplish what I am seeking.

What stands out to me about this issue is that the cache on the 3090 only goes to 12GB, so clearly it is possible to have a smaller cache on the A6000 as well.

Regarding code, this is part of a large project, so it’s difficult to make a minimal example, but I can try to make one if needed.

You can use torch.cuda.set_per_process_memory_fraction(fraction, device=None) to set the max allowed memory. The root cause for the increased cache usage is unclear since you are unable to share the model or any information.