[Documented fix] Slow execution, Pytorch using GPU shared memory

neoncube · April 9, 2025, 10:21am

Hi everyone!

I recently encountered slowdowns while training using my consumer-grade GPU (Nvidia GTX), and I wanted to document the root cause.

My GPU has 4GB of dedicated memory and 16GB of shared memory. When training, I would encounter an issue where training was fast at first and then would slow down massively. This wasn’t super obvious: it just felt like training was slow, and I didn’t know why.

The issue turned out to be that Pytorch doesn’t differentiate between GPU dedicated memory and GPU shared memory, but accessing shared GPU memory is, of course, much slower than accessing dedicated GPU memory. In a training loop, tensors end up first getting allocated in dedicated memory, then getting allocated in shared memory once the dedicated memory is full. When both dedicated and shared memory is full, Pytorch flushes its memory cache and the cycle starts over again.

My testing showed that when Pytorch is allocating tensors in shared memory, my training loop takes about 3x the time it takes when the tensors are allocated in dedicated memory (4.5s vs 1.5s).

A good way to prevent this is to tell NVIDIA to restrict Pytorch to only use dedicated GPU memory.

On Windows, there are two ways to do this:

Make it so that all programs can’t access shared memory. To do this, open NVIDIA Control Panel, go to Manage 3D settings, and on the Global Settings tab, change CUDA - Sysmen Fallback Policy from Driver Default to Prefer No Sysmem Fallback, and click the Apply button when it appears, several seconds later.
Make it so that only Python can’t access shared memory. To do this, open NVIDIA Control Panel, go to Manage 3D settings, click the Program Settings tab, click Add, add Python3.exe (or Python.exe?), and then follow the same instructions as above. NVIDIA also has a nice support article with how to do this, including screenshots: System Memory Fallback for Stable Diffusion | NVIDIA

The steps should be similar on Linux. See How to Install and Use the NVIDIA Control Panel on Linux for launching the Nvidia Control Panel on Linux

After enabling this setting, your training should be restricted to using only your GPU’s dedicated memory!

I hope this helps someone

neoncube · April 9, 2025, 10:30am

My guess is that server-grade GPUs don’t have GPU shared memory, so this probably only affects consumer-grade GPUs. Even so,I think it might be useful to document this somewhere in the official documentation.

Or, ideally, Pytorch would be able to tell the difference between dedicated GPU memory and shared GPU memory and would flush the cache whenever allocating a tensor would result in it being allocated in shared memory!

ptrblck · April 9, 2025, 12:00pm

I assume you mean system memory for offloading. If so then note that PyTorch does not trigger this behavior and it’s caused by the Windows driver as a “feature”.

No, since the Linux driver does not offload to the host. On Linux you could use the torch.cuda.MemPool API to e.g. use UVM to offload (sections) of your tensors but would also see the same poor performance for these sections of your code. Alternatively, you could use the native offloading APIs in PyTorch.

PyTorch does not trigger this behavior but some Windows drivers. I would assume this is already documented somewhere (I’m not using Windows so don’t know).

neoncube · April 9, 2025, 12:44pm

Yes, I know, thanks Still, I think it might be worth looking into whether Pytorch could automatically detect the GPU physical memory limit and keep the cache beneath that limit when possible. If people agree, then I can file on bug on the Pytorch Github

I couldn’t find anything in the Pytorch documentation, and I think this is an issue that essentially every Pytorch Windows user with a consumer-grade GPU will run into

Ah, interesting!

If your moderator powers permit, would you mind moving this to the Windows category, since it seems like it’s only valid on Windows?

ptrblck · April 9, 2025, 12:57pm

No, I don’t think changing PyTorch’s default behavior makes sense since PyTorch itself will properly raise an OOM when running out of memory on the device. If you want to lower the available memory to your PyTorch process you can use torch.cuda.set_per_process_memory_fraction which should hopefully also avoid this behavior on your Windows system.

I don’t think all Windows drivers enabled this behavior by default, but again I don’t use Windows and don’t know how widespread this is. I believe this is the second time a user mentioned this as an issue, so either Windows users expect this behavior or don’t think it’s hostile (I do think it is unexpected).

neoncube · April 9, 2025, 1:07pm

This article from Nvidia says that it’s enabled in driver 536.40+, which was released in June 2023.