[Strange GPU memory behavior] Strange memory consumption and out of memory error

EthanZhangYi · May 5, 2017, 10:58am

Hello, everyone. I am new to Pytorch and recently I learned to implement a CNN training for semantic segmentation. However I meet some strange GPU memory behavior and can not find the reason.

[1] Different memory consumption on different GPU platform
I have 2 GPU platforms for model training.
Same code, same ResNet50 CNN, same training data, same batchsize = 1
platform 1 [GTX 970 (4G), cuda-8.0, cudnn-5.0, Nvidia Driver-375.26] consume 3101M GPU memory
platform 2 [GTX TITAN X (12G), cuda-7.5, cudnn-5.0, Nvidia Driver-352.30] consume 1792M GPU memory
There is a big difference in GPU memory consumption, and I can not find the reason. Can anyone give some help?

[2] Strange out of memory error when training
On platform 1, when there is only one image/label pair in the training data list, the model is trained normally for 100 epochs.(batchsize = 1. thus 1 iteration per epoch) However, if I repeat the training pair thus there are 2 image/label pairs in the training data list, the training break down at the second iteration with out of memory error showed in the following image (batchsize = 1. thus 2 iteration per epoch). This error does not occur on the platform 2. So strange.

Anyone meet the same problem ? or Can anyone give some help?

THANKS!!!

smth · May 5, 2017, 11:37am

i’m wondering if cudnn is choosing different algorithms for convolution on platform1 and platform2.

Try the following on platform 1:

torch.backends.cudnn.enabled = False

Also separately maybe try:

torch.backends.cudnn.benchmark = True

EthanZhangYi · May 5, 2017, 12:22pm

Thanks very much, @smth That is exactly the reason for different memory consumption on 2 platforms. (cudnn)
On platform 1, the torch.backends.cudnn.enabled did not work, even it was set true. The memory consumption is still 3101M. However on platform 2, 1792M if torch.backends.cudnn.enabled=true, and 3100M if torch.backends.cudnn.enabled = false. So there are maybe something wrong with cudnn on platform 1.

And what do you think of the second problem? The strange out of memory behavior. cudnn.enabled and cudnn.benchmark seems not the reason.

D-X-Y · June 6, 2017, 10:18pm

I meet a similar problem.
Which is caused by torch.backends.cudnn.benchmark = True, when change to torch.backends.cudnn.enabled = True, everything is ok.

suke · January 11, 2018, 9:29am

hi, when I do
cudnn.enabled = True
cudnn.benchmark = True
it warns me "RuntimeError: CUDNN_STATUS_INTERNAL_ERROR"
do you know how to resolve it

EthanZhangYi · January 11, 2018, 11:08am

Sorry, I did not meet this error before. Could please post more error message?

suke · January 11, 2018, 11:17am

more error info
Traceback (most recent call last):
File “main_test.py”, line 14, in
t.test()
File “code/test.py”, line 64, in test
output = self.model(input)
File “/home/.local/lib/python2.7/site-packages/torch/nn/modules/module.py”, line 224, in call
result = self.forward(*input, **kwargs)
File “code/model/model.py”, line 46, in forward
x = self.headConv(x)
File “/home/.local/lib/python2.7/site-packages/torch/nn/modules/module.py”, line 224, in call
result = self.forward(*input, **kwargs)
File “/home/.local/lib/python2.7/site-packages/torch/nn/modules/conv.py”, line 254, in forward
self.padding, self.dilation, self.groups)
File “/home/.local/lib/python2.7/site-packages/torch/nn/functional.py”, line 52, in conv2d
return f(input, weight, bias)
RuntimeError: CUDNN_STATUS_INTERNAL_ERROR

EthanZhangYi · January 11, 2018, 1:09pm

It seems that error occurs in Conv Module. If it happens only when cudnn is enabled, there may be something wrong in cudnn library. Wrong version or something else.
Sorry I can not figure out the reason exactly.

suke · January 12, 2018, 2:54am

my pytorch version is 0.2.0, install using pip command follow official guide. I also suspect it is environment problem. but I don’t know how to resolve it.

EthanZhangYi · January 12, 2018, 3:14am

Sorry, I usually build pytorch from source and do not know what’s wrong with the pytorch you installed.
Maybe you can check if your pytorch link all libraries correctly.

cd /usr/local/lib/python2.7/dist-packages/torch
ldd _C.so

Enter the installation path and check the library link.

suke · January 12, 2018, 3:44am

lib link like below
linux-vdso.so.1 => (0x00007ffd291d5000)
libshm.so => /usr/local/lib/python2.7/dist-packages/torch/./lib/libshm.so (0x00007fe5e1c80000)
libcudart-5d6d23a3.so.8.0.61 => /usr/local/lib/python2.7/dist-packages/torch/./lib/libcudart-5d6d23a3.so.8.0.61 (0x00007fe5e1a18000)
libnvToolsExt-422e3301.so.1.0.0 => /usr/local/lib/python2.7/dist-packages/torch/./lib/libnvToolsExt-422e3301.so.1.0.0 (0x00007fe5e180e000)
libcudnn-3f9a723f.so.6.0.21 => /usr/local/lib/python2.7/dist-packages/torch/./lib/libcudnn-3f9a723f.so.6.0.21 (0x00007fe5d82aa000)
libTH.so.1 => /usr/local/lib/python2.7/dist-packages/torch/./lib/libTH.so.1 (0x00007fe5d5c0e000)
libTHS.so.1 => /usr/local/lib/python2.7/dist-packages/torch/./lib/libTHS.so.1 (0x00007fe5d59db000)
libTHPP.so.1 => /usr/local/lib/python2.7/dist-packages/torch/./lib/libTHPP.so.1 (0x00007fe5d543f000)
libTHNN.so.1 => /usr/local/lib/python2.7/dist-packages/torch/./lib/libTHNN.so.1 (0x00007fe5d511b000)
libATen.so.1 => /usr/local/lib/python2.7/dist-packages/torch/./lib/libATen.so.1 (0x00007fe5d47d6000)
libTHC.so.1 => /usr/local/lib/python2.7/dist-packages/torch/./lib/libTHC.so.1 (0x00007fe5c3632000)
libTHCS.so.1 => /usr/local/lib/python2.7/dist-packages/torch/./lib/libTHCS.so.1 (0x00007fe5c31f4000)
libTHCUNN.so.1 => /usr/local/lib/python2.7/dist-packages/torch/./lib/libTHCUNN.so.1 (0x00007fe5bedf8000)
libnccl.so.1 => /usr/local/lib/python2.7/dist-packages/torch/./lib/libnccl.so.1 (0x00007fe5bc11d000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fe5bbf07000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fe5bbcea000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fe5bb920000)
/lib64/ld-linux-x86-64.so.2 (0x00007fe5e3740000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fe5bb718000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fe5bb40f000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fe5bb20b000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fe5bae89000)
libgomp-ae56ecdc.so.1.0.0 => /usr/local/lib/python2.7/dist-packages/torch/./lib/libgomp-ae56ecdc.so.1.0.0 (0x00007fe5bac72000)
libcublas-e78c880d.so.8.0.88 => /usr/local/lib/python2.7/dist-packages/torch/./lib/libcublas-e78c880d.so.8.0.88 (0x00007fe5b7c2a000)
libcurand-3d68c345.so.8.0.61 => /usr/local/lib/python2.7/dist-packages/torch/./lib/libcurand-3d68c345.so.8.0.61 (0x00007fe5b3cb3000)
libcusparse-94011b8d.so.8.0.61 => /usr/local/lib/python2.7/dist-packages/torch/./lib/libcusparse-94011b8d.so.8.0.61 (0x00007fe5b1193000)

EthanZhangYi · January 12, 2018, 4:59am

It looks like all libraries are linked correctly. Maybe you can create a new topic in pytorch forums to get help from others. I am so sorry that I can not figure out the reason for this error.

suke · January 12, 2018, 5:14am

thank you all the same!

kk1153 · August 16, 2018, 11:00am

I am facing some similar issue where GPU memory consumed by the network on one platform one is 15 GB and platform 2 is 11 GB.

Platform1: NVIDIA Tesla P100 GPU 16GB, Cuda 9.1.85, Pytorch 0.4.0
Platform2: GTX 1080 GPU 12GB, Cuda 8.0.61 , Pytorch 0.4.0

Both print cudnn version to be 7102 inside python by torch,backends.cuda.version()

Strangely, when i turn torch.backends.cudnn.enabled=False in platform1 my gpu consumption becomes 11 GB. But i think gpu consumption should be lesser using cudnn as suggested in above answers. Can anyone help what could be the problem?

EthanZhangYi · August 16, 2018, 11:16am

I guess cudnn will benchmark different implementation methods and select the one with the best performance/fastest speed for different GPU platform.