Sudden surge in CPU RAM usage after upgrading pytorch v1.6 to v1.9

Amithpn · May 31, 2023, 5:45am

Hi,

I am noticing a ~3Gb increase in CPU RAM occupancy after the first .cuda() call. I recently updated the pytorch v1.6 to v1.9.0+cu111. After the upgrade i see there is increase in RAM utilization of ~3 GB when i load the model.

I’ve noticed this behavior in my power edge:

OS: Ubuntu 20.04.4 LTS
Processor: Intel® Xeon® Gold 6338N CPU @ 2.20GHz, 32 Cores
GPU: Nividia A2
Pytorch: 1.9.0+cu111

Below is the tabular description of the issue im facing:

torch version	CPU Memory consumption	GPU consumption
1.6	1.2 GB per process	0.9 GB per process
1.9	3.1 GB per process	1.7 GB per process

With this we see other processes are getting a hit due to less RAM available. Can this be addressed? Do we have any solution around the same?

ptrblck · May 31, 2023, 6:05am

Both used PyTorch releases are old by now so update to the latest stable or nightly release and check if it’s still the case as e.g. CUDA’s lazy loading was enabled which should reduce the device memory usage for >=11.7 and additionally the host memory usage for >=11.8.

Amithpn · May 31, 2023, 6:14am

I will try to upgrade to the latest torch. But the docker image size will also increase significantly. Any pointers to address that?

ptrblck · May 31, 2023, 7:06am

It’s unclear which docker image you are using and where the size increase is coming from, so I don’t know if you could address it.

Amithpn · May 31, 2023, 8:07am

I use ubuntu 20.04 as a base image to package the model as service for inferencing. So when i updated from torch 1.6 to 1.9, the image size grew from 7 Gb to 14 Gb.

Amithpn · June 2, 2023, 7:56am

Hi @ptrblck , i upgraded to torch 1.13.1 using the below command:

python3.7 -m pip install torch==1.13.1+cu117 torchvision==0.14.1 -f https://download.pytorch.org/whl/torch_stable.html

I see below exception when i try to import torchvision.

root@a761eb87f45e:/var/log/supervisor# python3
Python 3.7.7 (default, May 7 2020, 21:25:33)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type “help”, “copyright”, “credits” or “license” for more information.

import torchvision
/opt/conda/lib/python3.7/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: libc10_hip.so: cannot open shared object file: No such file or directory
warn(f"Failed to load image Python extension: {e}")

I found the compatible versions from this link:

Any idea what is causing the issue?

ptrblck · June 2, 2023, 8:39am

You are not running into an exception, but a warning in torchvision. Also, 1.13.1 is not the latest release.

Amithpn · July 10, 2023, 5:59am

Hi @ptrblck, i updated to the latest torch v2.0.1, cuda 11.7. I see some improvements in terms of GPU utilization and RAM memory usage. What is the improvements in cuda 11.8? Do you advise to update to cuda v11.8?

ptrblck · July 10, 2023, 7:03am

Yes, since lazy module loading was added in 11.7 and improved in 11.8: