Creating tensors on CPU and measuring the memory consumption?

Let’s say that I have a PyTorch tensor that I’m loading onto CPU. I would now like to experiment with different shapes and how they affect the memory consumption, and I thought the best way to do this is creating a simple random tensor and then measuring the memory consumptions of different shapes. However, while attempting this, I noticed anomalies and I decided to simplify the task further.

I’m creating a 3 GB PyTorch tensor and want to measure its memory consumption with the psutil module. In order to get some statistics, I do this ten times in a for loop and consider the mean and std. I also move the tensor to GPU and then use PyTorch functions to measure the allocated and totally reserved memory on GPU.

My code is this:

import torch
import psutil

if __name__ == '__main__':
    resident_memories = []
    for i in range(10):
        x = torch.ones((3, 1024, 1024, 1024), dtype=torch.uint8)
        resident_memory = psutil.Process().memory_info().rss/1024**2
        resident_memories.append(resident_memory)
        del x, resident_memory
    print('Average resident memory [MB]: {} +/- {}'.format(torch.mean(torch.tensor(resident_memories)), torch.std(torch.tensor(resident_memories))))
    del resident_memories

    device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
    alloc_memories = []
    reserved_memories = []
    for i in range(10):
        x = torch.ones((3, 1024, 1024, 1024), dtype=torch.uint8).to(device)
        alloc_memory = torch.cuda.memory_allocated(device=device)/1024**2
        reserved_memory = torch.cuda.memory_reserved(device=device)/1024**2
        alloc_memories.append(alloc_memory)
        reserved_memories.append(reserved_memory)
        del x, alloc_memory
    print('By tensors occupied memory on GPU [MB]: {} +/- {}\nCurrent GPU memory managed by caching allocator [MB]: {} +/- {}'.format(
          torch.mean(torch.tensor(alloc_memories)), torch.std(torch.tensor(alloc_memories)), torch.mean(torch.tensor(reserved_memories)),
          torch.std(torch.tensor(reserved_memories)))
    )

I obtain the following output:

Average resident memory [MB]: 4028.602783203125 +/- 0.06685283780097961
By tensors occupied memory on GPU [MB]: 3072.0 +/- 0.0
Current GPU memory managed by caching allocator [MB]: 3072.0 +/- 0.0

I’m executing this code on a cluster, but I also ran the first part on the cloud and I mostly observed the same behavior. When I ran this on the cluster, it was the only job on the CPU, so other jobs should (hopefully) not affect the memory consumption.

I’d have two quick questions:

i) Why is psutil.Process().memory_info().rss inaccurate when measuring the memory of a 3 GB tensor?

ii) How can we (correctly) measure the memory of tensors on CPU? One use case might be that the tensor is so huge that moving it onto a GPU might cause a memory error.

@ptrblck Do you think that you could help, please?

I’m not familiar enough with the Python memory management and garbage collection mechanism to be able to explain the effect you are seeing properly. Since rss returns the “non-swapped physical memory” of the process, you would not only see the required memory used by the tensor allocation but also the memory usage by the process itself, loaded libraries, as well as all other objects.
My guess would be that the garbage collection kicks in at different intervals and might free some memory, so you could check if playing around with its thresholds might change the behavior.

Thanks for your reply! I want to be a bit cautious, since I’m inexperienced when it comes to Python/PyTorch memory management, but I don’t think it’s the garbage collection mechanism that is responsible for the effects we’ve seen so far. I think I found the underlying issue. Let me explain:

  • When you wrote that I should play around with the thresholds of the garbage collection, I thought it might be best to completely turn it off. So if the optional garbage collection is enabled, I write at the beginning of my script:
import gc

if __name__ == '__main__': 
    if gc.isenabled(): 
        gc.disable()

However, after some tests, I saw no huge difference from the initially reported results.

  • What I found is that the statement import torch is quite expensive in terms of memory. For this, I used the following code:
import gc
import torch 
import psutil

if __name__ == '__main__': 
    print('Is Pythons optional garbage collector enabled? Answer: {}'.format(gc.isenabled()))
    if gc.isenabled():
        gc.disable()
        print('Pythons optional garbage collector was disabled.')
    
    resident_memory = psutil.Process().memory_info().rss/1024**2
    print('\nResident memory [MB]: {}'.format(resident_memory))

If I execute the following code on an RTX2080Ti with Cuda10.1, I consistently get as output for the resident memory (this time, I’m not reporting any error bars, but I made sure that the results are not orders of magnitudes apart from each other for different runs):

Resident memory [MB]: 190.37109375

However, the same code executed on an RTX2080Ti with Cuda11.0 yields:

Resident memory [MB]: 981.79296875

For both Cuda versions, I used Python 3.6.9, and for Cuda10.1, PyTorch 1.7.1+cu101, whereas for Cuda11.0, I had PyTorch 1.7.1 available.

My question would be: Why is there such a huge discrepancy of the resident memory when using different Cuda versions? (The results that I had reported in my initial post were on Cuda version 11.3 if I’m not wrong.)

I was thinking about the gc due to the change in the reported memory while using the same code snippet, but again I don’t know what bookkeeping is done additionally etc. You might want to take a look at this issue which discusses the memory overhead, increase etc.