MPS backend out of memory

Hello everyone, I am trying to run a CNN, using MPS on a MacBook Pro M2. After roughly 28 training epochs I get the following error:

RuntimeError: MPS backend out of memory (MPS allocated: 327.65 MB, other allocations: 8.51 GB, max allowed: 9.07 GB). Tried to allocate 240.25 MB on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure).

I have set the PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.7 without knowing what I’m doing, just to understand what happens. Turns out that this time I get the same error message after just 1 epoch of training the same model.

I cannot find anything about this error, other than in some issues of a stable diffusion github repo. Maybe any of you can point me in the right direction?

PS: I hesitate to set PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 as it might cause system failure.

Thanks for your help! :slight_smile:

1 Like

Same issue here, i don’t find any info on how to fix

1 Like

By the way, i figured it out @Hendrik_S. Here some details:

  • First of all, i do not know anything about GPU.
  • Practically, what I did was to reduce the image size, enough so that there wouldn’t be memory problems.
  • You can also reduce the batch size, for the same purpose.
  • You can also the setting to 0.9 or 0.1 (i don’t remember, but you can google this.) However, I wouldn’t touch this setting,

Other details

Now I was under the idea this has to do with the RAM, but the error rather points to the GPU (MPS) in a way, so I am assuming that GPUs have a RAM memory.

This is a reply from chat GPT so take it very carefully:

Why does the GPU have its own RAM?
The GPU, or Graphics Processing Unit, has its own RAM, also known as VRAM (Video Random Access Memory), for several reasons. First, the GPU needs a large amount of memory to store the textures, images, and other data that it uses to create the images on your screen. Second, the GPU needs its own memory so that it can access this data quickly and efficiently, without having to rely on the slower main memory of the computer. Third, having its own memory allows the GPU to work independently of the rest of the computer, which can improve performance and reduce the load on the CPU.

Hey everyone,
I have a small app that uses torchaudio. In the former version I used torchaudio 2.0.2 with MPS and it was able to process a longer then 30 minutes audio file on both my 16GB and 8GB RAM machines. The results were not good though.

I updated to torchaudio 2.1.0 and on my 16GB RAM machine it runs really good and the results are far better then with 2.0.2, but on my other 8GB RAM machine I now get the above mentioned oom error:

MPS backend out of memory (MPS allocated: 1.45 GB, other allocations: 7.42 GB, max allowed: 9.07 GB). Tried to allocate 563.62 MB on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure).

Why is there such a big difference in memory allocation between 2.0.2 and 2.1.0? Of course, since the results changed there must have happened a lot with MPS, but does anyone have a workaround for this?

When I check the RAM usage right after I get this error, it tells me only 2GB of my system memory is in use. There should be enough left.

please where had you located PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.7 ?

Solved with PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 in .zshrc

This suggests this bug is solved in 2.3.0dev. Does anyone know, what was done to solve it and if we get the fix in 2.2.1 already?