Weird Cuda out of memory error when I decrease the input size

Pengbo_Ma · December 11, 2018, 6:49am

I was training my model using Yolo v3
When I set my input size = 416, I can train my model with batch size = 9 without any errors.
However, when I decrease my input size to 320, I ran into Cuda memory error even when my batch size = 7.
I found this particularly strange, has anyone encountered anything similar before?
Error happened when executing loss.backward()
Thank you in advance for helping me!

Pengbo_Ma · December 11, 2018, 7:04am

Ok this is getting even more strange.
Keeping my input size = 320
I changed my batch size from 9 to 16, the model started training flawlessly without any errors

albanD · December 11, 2018, 10:37am

Hi,

Do you use cudnn? Do you use it in benchmark mode? What is the memory usage values when it actually run and what is the memory on your gpu?

Pengbo_Ma · December 12, 2018, 6:33pm

Hi albanD

No I did not explicitly use cudnn.
I did not use it in benchmark mode.
These were my memory and gpu memory usage when I ran my training with input size = 320, batch size = 9:

[2018-12-12 13:27:27,480 train.py] ram memory info before loss.backward(): svmem(total=17138393088, available=5950562304, percent=65.3, used=11187830784, free=5950562304)
[2018-12-12 13:27:27,481 train.py] max_memory_allocated before loss.backward(): 2788280832
[2018-12-12 13:27:27,481 train.py] memory_cached before loss.backward(): 2799042560
[2018-12-12 13:27:27,482 train.py] max_memory_cached before loss.backward(): 2799042560
[2018-12-12 13:27:27,482 train.py] memory_allocated before loss.backward(): 2772128768

Erros came right after this
line 70, in train
loss.backward()
File “C:\Users\jayde\Anaconda3\envs\new_env\lib\site-packages\torch\tensor.py”, line 93, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File “C:\Users\jayde\Anaconda3\envs\new_env\lib\site-packages\torch\autograd_init_.py”, line 90, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA error: out of memory

However when I set input size = 320, batch size = 16, this is what I get:

[2018-12-12 13:32:11,560 train.py] ram memory info before loss.backward(): svmem(total=17138393088, available=5892562944, percent=65.6, used=11245830144, free=5892562944)
[2018-12-12 13:32:11,561 train.py] max_memory_allocated before loss.backward(): 4837138432
[2018-12-12 13:32:11,561 train.py] memory_cached before loss.backward(): 5020450816
[2018-12-12 13:32:11,562 train.py] max_memory_cached before loss.backward(): 5020450816
[2018-12-12 13:32:11,562 train.py] memory_allocated before loss.backward(): 4684070912
[2018-12-12 13:32:11,749 train.py] ram memory info after loss.backward(): svmem(total=17138393088, available=5946978304, percent=65.3, used=11191414784, free=5946978304)
[2018-12-12 13:32:11,749 train.py] max_memory_allocated after loss.backward(): 5460978176
[2018-12-12 13:32:11,749 train.py] memory_cached after loss.backward(): 6174408704
[2018-12-12 13:32:11,750 train.py] max_memory_cached after loss.backward(): 6174408704
[2018-12-12 13:32:11,750 train.py] memory_allocated after loss.backward(): 518663680
[2018-12-12 13:32:12,162 train.py] ram memory info before loss.backward(): svmem(total=17138393088, available=5980577792, percent=65.1, used=11157815296, free=5980577792)
[2018-12-12 13:32:12,162 train.py] max_memory_allocated before loss.backward(): 5460978176
[2018-12-12 13:32:12,163 train.py] memory_cached before loss.backward(): 6193283072
[2018-12-12 13:32:12,163 train.py] max_memory_cached before loss.backward(): 6193283072
[2018-12-12 13:32:12,163 train.py] memory_allocated before loss.backward(): 5199972864
[2018-12-12 13:32:12,315 train.py] ram memory info after loss.backward(): svmem(total=17138393088, available=5965262848, percent=65.2, used=111731

No memory errors were raised.

Thank you again for helping me!

albanD · December 13, 2018, 10:29am

Hi,

Are you sharing the GPU with other people?
Pytorch reports <3GB used but cuda report 11+GB in use on the device ( in the case where it fails).
As you can see, in the case where it works with a bigger batch size, pytorch actually uses much more memory: 5/6GB.

If there is a lot of memory already used on the gpu even though nothing is running (you can check with nvidia-smi). You can use (sudo) lsof /dev/nvidiaX where X is the number of the gpu you want to check. This will list all process using the gpu more thoroughly than nvidia-smi. If there are some things that shouldn’t be here like old python process, you can get rid of them by killing them.

Pengbo_Ma · December 13, 2018, 2:40pm

Hi @albanD
Thank you so much for your reply!

No I did not share GPU with anyone.
The 11+GB is my local ram but gpu memory. I’m using a 2080 so it has only 8 GB of vram.

The thing is it fails every single time if I set input size = 320, batch = 9. However if I do input size = 320, batch = 16 right after that, it runs perfectly every single time too.

I tried to clear cache before back propagation too but the error incurred .

Thank you again for helping me!

albanD · December 13, 2018, 2:43pm

Ho,

What happens if you add torch.backends.cudnn.enabled=False at the beginning of your code? Does the error still occurs?

Pengbo_Ma · December 14, 2018, 2:34am

Hi @albanD

Yes it solves the problem!!
Thank you so much for helping!
On the other, I have a couple of question regarding to this.

I have never set torch.backends.cudnn.enabled=True, so is True the default setting in pytorch?
Why would setting it True resulting this error?
3, In what occasion that setting torch.backends.cudnn.enabled=True would be beneficial?

Update:
So I did one more round of test,
if I set
torch.backends.cudnn.enabled = True
and
torch.backends.cudnn.benchmark = True
no errors were raised too.

But if I set
torch.backends.cudnn.benchmark = False
Then error appeared again.

Previously I was having this impression that the error was caused by the benchmark as it continues to search for the best algorithm. Now I am even more confused.

Once again thank you so much for your help!!!

albanD · December 14, 2018, 11:09am

Hi,

Yes I just remembered, that there was a bug in some handling of cudnn algorithm.
Basically cudnn choose the fastest possible algorithm but that sometimes require more memory than what you have. This should be handled properly and select a less memory hangry algorithm. There was a bug in this fixed in this pr.
Not sure if it made it into 1.0 but I think it did. Enabling benchmark mode should fix the issue and keep the best runtime !

Pengbo_Ma · December 14, 2018, 3:04pm

Thank you so much @albanD!

Just one last question, do we need to install cudnn separately or it will be automatically installed when conda install torch?

Thank you again for helping me!

albanD · December 14, 2018, 3:04pm

It comes with the conda install and is enabled by default.