Out Of Memory Error CUDA

I am getting this error : torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 6.60 GiB (GPU 0; 23.65 GiB total capacity; 14.12 GiB already allocated; 3.97 GiB free; 18.15 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF.

I have NVIDIA-SMI 470.199.02 Driver Version: 470.199.02 CUDA Version: 11.4 in my base environment.

I have installed torch in my environment using : ```
pip install torch==1.13.0+cu116 torchvision==0.14.0+cu116 torchaudio==0.13.0 --extra-index-url https://download.pytorch.org/whl/cu116

I  tried by installing torch==1.12.0 using :pip install torch==1.12.0+cu113 torchvision==0.13.0+cu113 torchaudio==0.12.0 --extra-index-url https://download.pytorch.org/whl/cu113 but still i get the same error.

Can somebody please help me by suggesting how to solve the error

Any reason for using such and outdated version of Pytorch? Try upgrading to fresher version of the libs. There is also an environment property PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True that can help if your data tensor size varies from batch to batch. Other than that it is a bit difficult to say based on the information you have provided. What makes you think that you are not genuinely running out of memory?

Thanks for the reply. I am combining two different models.One requires pytorch 1.0.1 and the other might also run on the latest torch.But I thought taking into consideration the CUDA version I had ,I choose a middle ground.I dont really know how right this is but what I think is as I have cuda version 11.4 in my base environement and the newer pytorch versions might not work as expected which is causing the error.Please correct me if I am wrong.Please let me know what additional information do you need so that you can help me solve the error

The first thing to try is to run the model on a tiny data sample, minibatch size of 1, and any other variable parameters of data set to minimum values. Again, difficult to say more, because I do not know what kind of model it is and what kind of data it is expecting (images, text, …). Run inference separately, with gradients disabled. If that works - then run with the gradients.

If working with an old model I would first try to run everything on the old version (v1.0.1 in your case) to make sure that the older version works. Alternatively, try to port everything to the more recent versions of all the libs, including CUDA drivers, assuming your hardware permits it). Generally, the more recent version should (hopefully) have more bugs ironed out and be better optimised. With the “middle ground” approach you might be getting the worst of both worlds.

How much memory does your GPU physically have? Are there any other applications using it? How much memory is consumed by the model checkpoint, by the gradients, and by the data?

Here are some APIs to check the memory usage.

Good luck :slight_smile:

I have two modesl : model and model1.model uses batchsize=1and model1 is used to compute the distance between images.I can make model1 run on CPU by setting torch.device( ‘cpu’) (please correct me if I am doing anything wrong here) .But still I get the same error.Both the models perform their operations as expected but at the end it throwsup the error.I have python 3.8.19 with torch 1.12.1 in my environment.If i just run my model my commenting model1 it runs without any error.So I think problem is with model1.The problem if I install torch 1.0.1 is my new modeI1’s dependencies require torch>=1.13.0 so I cant install torch==1.0.1 .I have 23.65Gib memory on GPU.There are no process/appliations using memory as checked with nvidia-smi.Why does pytorch reserve memory? why cant it use all of the available memory?Is there any way to increase the reserved memory?Does the reserved memory vary with the version of torch or cuda ?.Please help me with this problem as I am stuck with it from a long time and I am newbie so I dont have knowledge about this things.Your help is highly appreciated