How to explain huge GPU RAM usage?


(Doms) #1

Hello,

latest pytoch, cuda 9 and cudnn7 installed using conda on linux (lubuntu 16.04, latest nvidiam driver).

I am running this code on a computer with 2x GTX 1080 Ti:

Previsously, I was running the imagenet example from the pytorch example.

For both of them, the RAM needed to run the training seems to be expensive according to the same kind of training using caffe or tensorflow - preventing it to be greedy on the RAM.
For instance, on the pose estimation, there is a burst of RAM either at end of an epoch either at the beginning of the evaluation step leading to a cuda error (no more memory). With 11Gb of RAM on each GPU, I can run the training only with batch size = 32. The script is using, within an epoch, only half of the available RAM.

For the imagenet example, same problem, I fixed the batch_size to 80 to be able to train a VGG19_bn. Another problem, when restarting from a check point, the script dies at the end of each epoch with cuda error: no more memory, even with the same batch size.

Can anyone help me understand this burst of RAM?

Thank you.


(Doms) #2

If it can help, here is the package version from my conda environment:

Name Version Build Channel

appdirs 1.4.3 py37h28b3542_0
asn1crypto 0.24.0 py37_0
attrs 18.2.0 py37h28b3542_0
automat 0.7.0 py37_0
backcall 0.1.0 py37_0
blas 1.0 mkl
bleach 2.1.4 py37_0
bzip2 1.0.6 h14c3975_5
ca-certificates 2018.03.07 0
cairo 1.14.12 h8948797_3
certifi 2018.8.24 py37_1
cffi 1.11.5 py37h9745a5d_0
chardet 3.0.4 py37_1
conda 4.5.11 py37_0
conda-env 2.6.0 1
constantly 15.1.0 py37h28b3542_0
cryptography 2.3.1 py37hc365091_0
cudatoolkit 9.0 h13b8566_0
cudnn 7.1.2 cuda9.0_0
cycler 0.10.0 py37_0
cython 0.28.5 py37hf484d3e_0
dbus 1.13.2 h714fa37_1
decorator 4.3.0 py37_0
easydict 1.8
entrypoints 0.2.3 py37_2
expat 2.2.5 he0dffb1_0
ffmpeg 4.0 hcdf2ecd_0
fontconfig 2.13.0 h9420a91_0
freeglut 3.0.0 hf484d3e_5
freetype 2.9.1 h8a8886c_0
future 0.16.0
glib 2.56.1 h000015b_0
gmp 6.1.2 h6c8ec71_1
graphite2 1.3.12 h23475e2_2
gst-plugins-base 1.14.0 hbbd80ab_1
gstreamer 1.14.0 hb453b48_1
harfbuzz 1.8.8 hffaf4a1_0
hdf5 1.10.2 hba1933b_1
html5lib 1.0.1 py37_0
hyperlink 18.0.0 py37_0
icu 58.2 h9c2bf20_1
idna 2.7 py37_0
incremental 17.5.0 py37_0
intel-openmp 2018.0.3 0
ipykernel 4.9.0 py37_0
ipython 6.5.0 py37_0
ipython_genutils 0.2.0 py37_0
ipywidgets 7.4.1 py37_0
jasper 2.0.14 h07fcdf6_1
jedi 0.12.1 py37_0
jinja2 2.10 py37_0
jpeg 9b h024ee3a_2
jsonschema 2.6.0 py37_0
jupyter 1.0.0 py37_5
jupyter_client 5.2.3 py37_0
jupyter_console 5.2.0 py37_1
jupyter_core 4.4.0 py37_0
kiwisolver 1.0.1 py37hf484d3e_0
libedit 3.1.20170329 h6b74fdf_2
libffi 3.2.1 hd88cf55_4
libgcc-ng 8.2.0 hdf63c60_1
libgfortran-ng 7.2.0 hdf63c60_3
libglu 9.0.0 hf484d3e_1
libopencv 3.4.2 hb342d67_1
libopus 1.2.1 hb9ed12e_0
libpng 1.6.34 hb9fc6fc_0
libsodium 1.0.16 h1bed415_0
libstdcxx-ng 8.2.0 hdf63c60_1
libtiff 4.0.9 he85c1e1_1
libuuid 1.0.3 h1bed415_2
libvpx 1.7.0 h439df22_0
libxcb 1.13 h1bed415_1
libxml2 2.9.8 h26e45fe_1
markupsafe 1.0 py37h14c3975_1
matplotlib 2.2.3 py37hb69df0a_0
mistune 0.8.3 py37h14c3975_1
mkl 2018.0.3 1
mkl_fft 1.0.4 py37h4414c95_1
mkl_random 1.0.1 py37h4414c95_1
nbconvert 5.3.1 py37_0
nbformat 4.4.0 py37_0
nccl 1.3.5 cuda9.0_0
ncurses 6.1 hf484d3e_0
ninja 1.8.2 py37h6bb024c_1
notebook 5.6.0 py37_0
numpy 1.15.1 py37h1d66e8a_0
numpy-base 1.15.1 py37h81de0dd_0
olefile 0.45.1 py37_0
opencv 3.4.2 py37h6fd60c2_1
openssl 1.0.2p h14c3975_0
pandoc 2.2.3.2 0
pandocfilters 1.4.2 py37_1
parso 0.3.1 py37_0
pcre 8.42 h439df22_0
pexpect 4.6.0 py37_0
pickleshare 0.7.4 py37_0
pillow 5.2.0 py37heded4f4_0
pip 10.0.1 py37_0
pixman 0.34.0 hceecf20_3
prometheus_client 0.3.1 py37h28b3542_0
prompt_toolkit 1.0.15 py37_0
protobuf 3.6.1
ptyprocess 0.6.0 py37_0
py-opencv 3.4.2 py37hb342d67_1
pyasn1 0.4.4 py37h28b3542_0
pyasn1-modules 0.2.2 py37_0
pycocotools 2.0.0
pycosat 0.6.3 py37h14c3975_0
pycparser 2.18 py37_1
pygments 2.2.0 py37_0
PyHamcrest 1.9.0
pyopenssl 18.0.0 py37_0
pyparsing 2.2.0 py37_1
pyqt 5.9.2 py37h22d08a2_0
pysocks 1.6.8 py37_0
python 3.7.0 hc3d631a_0
python-dateutil 2.7.3 py37_0
pytorch 0.4.1 py37ha74772b_0
pytz 2018.5 py37_0
pyyaml 3.13 py37h14c3975_0
pyzmq 17.1.2 py37h14c3975_0
qt 5.9.6 h52aff34_0
qtconsole 4.4.1 py37_0
readline 7.0 ha6073c6_4
requests 2.19.1 py37_0
ruamel_yaml 0.15.46 py37h14c3975_0
scipy 1.1.0 py37hfa4b5c9_1
send2trash 1.5.0 py37_0
service_identity 17.0.0 py37h28b3542_0
setuptools 40.0.0 py37_0
simplegeneric 0.8.1 py37_2
sip 4.19.8 py37hf484d3e_0
six 1.11.0 py37_1
sqlite 3.24.0 h84994c4_0
tensorboardX 1.4
terminado 0.8.1 py37_1
testpath 0.3.1 py37_0
tk 8.6.7 hc745277_3
torchvision 0.2.1 py37_1 pytorch
tornado 5.1 py37h14c3975_0
traitlets 4.3.2 py37_0
twisted 18.7.0 py37h14c3975_1
urllib3 1.23 py37_0
wcwidth 0.1.7 py37_0
webencodings 0.5.1 py37_1
wheel 0.31.1 py37_0
widgetsnbextension 3.4.1 py37_0
xz 5.2.4 h14c3975_4
yaml 0.1.7 had09818_2
zeromq 4.2.5 hf484d3e_1
zlib 1.2.11 ha838bed_2
zope 1.0 py37_1
zope.interface 4.5.0 py37h14c3975_0


(Adrián Javaloy) #3

My gueses are:

  • You don’t evaluate under a with torch.no_grad() environment.
  • You are storing the tensor losses instead of their values (that is, you are missing calling the item() method of the tensor) keeping in memory the whole computation graph.

Check those two first and see if that’s the problem. At this moment I don’t have time to check out your code.


(Doms) #4

Thank you for your answer.

I already test the loss hypothesis. I did not test with torch.no_grad(). The training is currently running (but it will failed as the memory is more and more filled… ).

If someone has some other hypothesis about this problem, he is welcome to submit it before the next traning.

thank you.


(Doms) #5

Sorry for my late answer. Thus, the point:

  • I did not manage to make it work with ` torch.no_grad(). And actually, I want to compare with other runs thus I drop this idea.
  • I tried to add some “del” on tensors (as seen in pytorch document), no changes appeared. The amazing thing is that pytorch start with half of the GPU memory and finish to fill it after several epochs but I have no cuda errors running the training (at the end, the memory of both GPU is almost full). I still do not undestand what is hapenning… Is there any way to track all memory blocks in the GPU and where they were allocated?

Thank you.

PS: anyway, my run went fine, it is just a pity that with 2x1080 Ti GPUs, I can not run with bigger batch size…