RAM/CPU memory leak with transforms

fnak · January 3, 2022, 7:39am

Hello,
I have been trying to debug an issue where, when working with a dataset, my RAM is filling up quickly. It turns out this is caused by the transformations I am doing to the images, using transforms.
My code is very simple:

  for dir1 in os.listdir(img_folder):
      for file in os.listdir(os.path.join(img_folder, dir1)):
          image_path = os.path.join(img_folder, dir1,  file)
          with Image.open(image_path) as img_pil:
            normalize = transforms.Normalize(mean=mean,std=std)
            preprocess = transforms.Compose([
              transforms.Resize((img_size,img_size)),
              transforms.ToTensor(),
              normalize
            ])
            img_pil = preprocess(img_pil)

Without running the “preprocess code”, the memory is emptied correctly upon opening and closing images.

I have tried defining the normalize and preprocess function outside the loop, but memory was still accumulating.

Am I missing something? Is there a way to free up the memory that is being occupied by the transformation steps?

NB: same issue arises when using a dataloader. But I didn’t know what was causing it, that’s how I ended up here.

Thanks

my3bikaht · January 3, 2022, 2:21pm

You can use batch transforms outside of loading loop (it is probably much faster too).

I wonder if redefining img_pil inside with loop is causing this issue for the PIL library.

fnak · January 3, 2022, 5:21pm

If I want to do batch transform, I’ll have to open all images in memory, which would kinda lead to the same result. I am working on a very limited amount of RAM, and I want to open each image at a time, transform it, do some predictions, close it, and move to another.

fnak · January 3, 2022, 6:13pm

It seems the tensor operation is what causing this issue.
I dug a bit deeper into the transform function. The issue is caused by the following line:
tensor.sub_(mean).div_(std).

I tried to imitate it manually doing the following:

Outside the loop:

MEAN = 255 * torch.tensor([0.485, 0.456, 0.406])
STD = 255 * torch.tensor([0.229, 0.224, 0.225])
meanOP = MEAN[:, None, None]
stdOP = STD[:, None, None]

In the loop:
img_pil = (img_pil - meanOP / stdOP)

The issue is reproduced with the above.
So it seems it is related to tensor operations.

fnak · January 3, 2022, 7:16pm

UPDATE:

It seems that the issue is worse than i thought. It could be related to any tensor operation. Simple operation such as changing the type of the tensor to float32 is causing this memory problem as well.

For some reason, the memory is not being cleaned.
PS: I tried forcing garbage collection. It was not useful.

          with Image.open(image_path) as img_pil:
            img_pil = torch.from_numpy(np.array(img_pil))
            img_pil = img_pil.type(torch.float32)

my3bikaht · January 3, 2022, 9:42pm

Here’s a script I tried at local machine:

tf = transforms.Compose(
    [transforms.Resize((1000, 1000)),
     transforms.PILToTensor()])

for fname in list:
    with Image.open(fname) as img_pil:
        img_pil = tf(img_pil)
        img_pil = img_pil.type(torch.float32)
    print(psutil.virtual_memory().available)

and here’s output (swap is disabled):

54919290880
54912282624
54912274432
54900854784
54903816192
54903750656
54904684544
54904807424
54904815616
54904860672
54904860672
54904729600
54904963072
54904954880
54904762368
54904885248
54904717312
54904791040
54904856576
54904930304
54904983552
54904717312
54905122816
54904283136
54904274944
54904217600
54904332288

as you can see memory released just fine

ptrblck · January 3, 2022, 10:39pm

Could you post an executable code snippet to reproduce the increasing memory usage?
I’ve seen similar results to @my3bikaht’s post and couldn’t reproduce it.

fnak · January 4, 2022, 5:52pm

I checked your suggestions and turns out I have the same result. After isolating the problem, It seems that the issue is caused by the profiler to measure the performance of the model over all the test set, as shown in the code below:

import torch
import torchvision.transforms as transforms
import os
from PIL import Image
import psutil
from torch.profiler import profile, record_function, ProfilerActivity

mean = (0.485, 0.456, 0.406)
std = (0.229, 0.224, 0.225)

img_folder = "/path/to/img"
img_size = 224

def test():
  preprocess = transforms.Compose([
      transforms.Resize((img_size,img_size)),
      transforms.ToTensor(),
      transforms.Normalize(mean=mean, std=std)
  ])

  for dir1 in os.listdir(img_folder):
    for file in os.listdir(os.path.join(img_folder, dir1)):
      image_path = os.path.join(img_folder, dir1,  file)
      with Image.open(image_path) as img_pil:
        img_pil = preprocess(img_pil)
      memory = psutil.virtual_memory()
      totmemory = memory.total >> 20
      usedmemory = memory.used >> 20
      print(usedmemory)

with profile(activities=[ProfilerActivity.CPU], profile_memory=True, record_shapes=True) as prof:
  test()
print(prof.key_averages().table(sort_by="self_cpu_memory_usage", row_limit=10))

Is there a better-performing way to profile the model without building up the RAM?
I have created a colab here:

love_ptrblck · August 4, 2023, 1:39am

Experiencing the same issue inside torch dataset but without profile.

# works fine
images = images.numpy()
images = images / 255.0
images = torch.from_numpy(images)

# memory accumulates and explodes
images = torch.from_numpy(images)
images = images / 255.0 # or any kind of tensor operation

system information

PyTorch version: 2.0.0+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.5 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: version 3.26.3
Libc version: glibc-2.31

Python version: 3.9.5 (default, Nov 23 2021, 15:27:38)  [GCC 9.3.0] (64-bit runtime)
Python platform: Linux-3.10.0-1160.59.1.el7.x86_64-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 11.7.99
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: A100 Graphics Device
GPU 1: A100 Graphics Device
GPU 2: A100 Graphics Device
GPU 3: A100 Graphics Device
GPU 4: A100 Graphics Device
GPU 5: A100 Graphics Device
GPU 6: A100 Graphics Device
GPU 7: A100 Graphics Device

Nvidia driver version: 450.80.02
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.5.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   43 bits physical, 48 bits virtual
CPU(s):                          64
On-line CPU(s) list:             0-63
Thread(s) per core:              2
Core(s) per socket:              16
Socket(s):                       2
NUMA node(s):                    2
Vendor ID:                       AuthenticAMD
CPU family:                      23
Model:                           49
Model name:                      AMD EPYC 7302 16-Core Processor
Stepping:                        0
CPU MHz:                         2994.171
BogoMIPS:                        5988.34
Virtualization:                  AMD-V
L1d cache:                       1 MiB
L1i cache:                       1 MiB
L2 cache:                        16 MiB
L3 cache:                        256 MiB
NUMA node0 CPU(s):               0-15,32-47
NUMA node1 CPU(s):               16-31,48-63
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1:        Vulnerable: Load fences, __user pointer sanitization and usercopy barriers only; no swapgs barriers
Vulnerability Spectre v2:        Vulnerable
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc art rep_good nopl nonstop_tsc extd_apicid aperfmperf eagerfpu pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_l2 cpb cat_l3 cdp_l3 hw_pstate sme retpoline_amd ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip overflow_recov succor smca

Versions of relevant libraries:
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.24.2
[pip3] torch==2.0.0+cu117
[pip3] torch-tb-profiler==0.4.1
[pip3] torchaudio==2.0.1+cu117
[pip3] torchinfo==1.8.0
[pip3] torchvision==0.15.1+cu117
[conda] Could not collect

love_ptrblck · August 4, 2023, 2:34am

My case seems to be related to when h5py objects converted to torch.Tensor

Did not find any solution or workaround.

github.com/facebookresearch/fastMRI

Memory leak with `h5py` from `pip` and conversion to `torch.Tensor`

opened 03:57PM - 15 Feb 22 UTC

Breeze-Zero

bug

I recently tried to do some experiments on my model with multi-coil FastMRI brai…n data. Due to the need for flexibility (and also because I don't have the extra time to learn how to use Pytorch lighting), I didn't use Pytorch Lighting directly. Instead, I chose normal Pytorch, but during the iterating process, I only set num_worker=2, and my memory footprint was quite large at the beginning. As the number of iterations increased, an error occurred: RuntimeError: DataLoader worker (PID 522908) is killed by signal: killed. I checked the training codes of other parts, but no obvious memory accumulation error was found. Therefore, I thought there was a large probability of a problem in siliceDataset. I simply used "pass" to traverse the Dataloader loop, and found that the memory occupation kept rising.