Sudden surge in CPU RAM usage after upgrading pytorch v1.6 to v1.9

Hi,

I am noticing a ~3Gb increase in CPU RAM occupancy after the first .cuda() call. I recently updated the pytorch v1.6 to v1.9.0+cu111. After the upgrade i see there is increase in RAM utilization of ~3 GB when i load the model.

I’ve noticed this behavior in my power edge:

OS: Ubuntu 20.04.4 LTS
Processor: Intel® Xeon® Gold 6338N CPU @ 2.20GHz, 32 Cores
GPU: Nividia A2
Pytorch: 1.9.0+cu111

Below is the tabular description of the issue im facing:

torch version CPU Memory consumption GPU consumption
1.6 1.2 GB per process 0.9 GB per process
1.9 3.1 GB per process 1.7 GB per process

With this we see other processes are getting a hit due to less RAM available. Can this be addressed? Do we have any solution around the same?

Both used PyTorch releases are old by now so update to the latest stable or nightly release and check if it’s still the case as e.g. CUDA’s lazy loading was enabled which should reduce the device memory usage for >=11.7 and additionally the host memory usage for >=11.8.

I will try to upgrade to the latest torch. But the docker image size will also increase significantly. Any pointers to address that?

It’s unclear which docker image you are using and where the size increase is coming from, so I don’t know if you could address it.

I use ubuntu 20.04 as a base image to package the model as service for inferencing. So when i updated from torch 1.6 to 1.9, the image size grew from 7 Gb to 14 Gb.

Hi @ptrblck , i upgraded to torch 1.13.1 using the below command:

python3.7 -m pip install torch==1.13.1+cu117 torchvision==0.14.1 -f https://download.pytorch.org/whl/torch_stable.html

I see below exception when i try to import torchvision.

root@a761eb87f45e:/var/log/supervisor# python3
Python 3.7.7 (default, May 7 2020, 21:25:33)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type “help”, “copyright”, “credits” or “license” for more information.

import torchvision
/opt/conda/lib/python3.7/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: libc10_hip.so: cannot open shared object file: No such file or directory
warn(f"Failed to load image Python extension: {e}")

I found the compatible versions from this link:

Any idea what is causing the issue?

You are not running into an exception, but a warning in torchvision. Also, 1.13.1 is not the latest release.

Hi @ptrblck, i updated to the latest torch v2.0.1, cuda 11.7. I see some improvements in terms of GPU utilization and RAM memory usage. What is the improvements in cuda 11.8? Do you advise to update to cuda v11.8?

Yes, since lazy module loading was added in 11.7 and improved in 11.8: