Memory Management using PYTORCH_CUDA_ALLOC_CONF

Can I do anything about this, while training a model I am getting this cuda error:

RuntimeError: CUDA out of memory. Tried to allocate 30.00 MiB (GPU 0; 2.00 GiB total capacity; 1.72 GiB already allocated; 0 bytes free; 1.74 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Reduced batch_size from 32 to 8, Can I do anything else with my 2GB card :stuck_out_tongue:

Hi @krishna511,

You can try changing image size, batch size or even the model.

I suggest you to try Google Colab (which is free) to train your model: with only 2 GB is very very challenging.

Hi there im new here and i hope im doing this right. I am getting that error in google colabs and it suggests “See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF”

im not sure what the solution could be despite trying a number of things. im using the stability ai diffusion colab and have a pro account w google. im using a single batch file and the default image size 512x there is no reason i can think of for this error… other than maybe a cache that needs to be cleared since ive been using the colab for most of the morning and changed my workflow several times

Does it mean the script was working before and is crashing without any changes?
The general error is of course raised because you are running out of memory on the GPU. Setting the allocator config to another value could help if a lot of fragmentation is happening, but I also don’t know which values would then be recommended (I wasn’t lucky to actually fix a valid OOM using it).

I have also met this problem. The taining process chould work sucessfully a few days ago. But after I saved the checkpoints( I don’t know if it’s the real reason), the memory of GPU will rise to almost full state immediately every time I want to restar the training. It may be sloved by cleaning the memory or device of this model or create another file as same as before. However, so far I didn’t find the best way to solve this problem.

I think you just need to adjust the bs=“batch_size” parameter in your ImageDataLoaders to be a small integer number default is 64 try with 16 or 32