Tracemalloc giving wrong numbers for memory?

Let’s take a look at this code:

import tracemalloc
import numpy as np

tracemalloc.start()
a = np.ones((3, 1024, 1024, 1024), dtype=np.uint8)
current, peak = tracemalloc.get_traced_memory()
print(f'Current: {current / 1024**2:.2f} [GB], peak: {peak / 1024**2:.2f} [GB] / 1024**2')

The output I get is

Current: 3072.00 [MB], peak: 3072.00 [MB]

Obviously, we expect an array with a size of about 3 GB, so this looks good. However, if we do the same thing in PyTorch, i.e.

import torch
import tracemalloc

tracemalloc.start()
x = torch.ones((3, 1024, 1024, 1024), dtype=torch.uint8)
current, peak = tracemalloc.get_traced_memory()
print(f'Current: {current / 1024**2:.2f} [MB], peak: {peak / 1024**2:.2f} [MB]')

I get the following output:

0.00 [MB], 0.00 [MB]

Something seems to be off. I tested PyTorch version 1.9.0 and Python version 3.8.10.


My question is: Is this expected behavior? I myself am surprised.

I don’t know anything about how tensor.ones is implemented (source code). My guess would be that it is similar to lazy evaluation/generator. PyTorch will not allocate memory and fill it with 1’s until the tensor is actually needed for operation on the hardware. This way, PyTorch doesn’t waste precious resources (there tends to be lots of data to be processed in ML/data science) with data that is not needed immediately.

@tiramisuNcustard I tried the same thing with torch.rand() instead of torch.ones() - though I had to use dtype = torch.float32 in that case - and basically observed the same behavior. Do you think PyTorch will still not fill the tensor up with random numbers before we do some operations on it?

Also, any way we can explicitly test your hypothesis? For example by doing operations on the tensor that force PyTorch to write the tensor onto the hardware, but that are not memory-intensive?

@Imahn, I would create two 1 GB random tensors. Check the memory allocation. Then multiply the two tensors and store it in a variable. Then check the memory allocation again. If there is a difference in the two memory allocation checks then I would be satisfied with my answer. The reason I am suggesting 1GB is because, again, I don’t know how PyTorch implements these functions (will they allocate memory if the size is not too large - let’s say for example < 200 MB?).

@tiramisuNcustard Sorry for late reply. I tested the following code (x and y should roughly occupy 1 GB of memory):

import torch 
import tracemalloc

tracemalloc.start()
x = torch.rand(256, 1024, 1024)
y = torch.rand(256, 1024, 1024)
current, peak = tracemalloc.get_traced_memory()
out = x * y
current_mult, peak_mult = tracemalloc.get_traced_memory()
print(f'Current/peak before multiplication [GB]: {current / 1024**3:.2f}, '
      f'peak before multiplication: {peak / 1024**3:.2f} [GB]'
      f'\nCurrent/peak after multiplication [GB]: {current_mult}, {peak_mult}'
      f'\nTensor dtype: {x.dtype}, output shape: {out.shape}')

The output is:

Current/peak before multiplication [GB]: 0.00, 0.00 
Current/peak after multiplication [GB]: 192.00, 192.00
Tensor dtype: torch.float32, output shape: torch.Size([256, 1024, 1024])

Tbh, I’m now totally puzzled about why the memory consumption is so huge after the multiplication (192 GB)? I would have expected 2 GB for x and y, and maybe a few GB for the multiplication, but O(100) GB for a multiplication sounds like a lot too me?

@Imahn, Your first print statement is the following Current/peak before multiplication [GB]: {current / 1024**3:.2f}, if we copy it over to the print statement after the multiplication i.e. Current/peak after multiplication [GB]: {current_mult / 1024**3:.2f} what do we get?

Maybe, the missing / 1024**3:.2f is causing the trouble?

@tiramisuNcustard Sorry, totally right. Corrected output:

Current/peak before multiplication [GB]: 0.00, 0.00 
Current/peak after multiplication [GB]: 0.00, 0.00

What does out.element_size() * out.nelement() give you? Can you use it before current_mult, peak_mult = tracemalloc.get_traced_memory() and see what it shows?

Sure, before

current_mult, peak_mult = tracemalloc.get_traced_memory()

I added

print(f'Expected memory consump [GB]: {out.nelement() * out.element_size() / 1024**3:.2f}')

(as you suggested), the output is

Expected memory consump [GB]: 1.00

I did the same for x and y before

current, peak = tracemalloc.get_traced_memory()

and I get for both x and y 1 GB each as the expected memory consumption…