I don’t know anything about how tensor.ones is implemented (source code). My guess would be that it is similar to lazy evaluation/generator. PyTorch will not allocate memory and fill it with 1’s until the tensor is actually needed for operation on the hardware. This way, PyTorch doesn’t waste precious resources (there tends to be lots of data to be processed in ML/data science) with data that is not needed immediately.
@tiramisuNcustard I tried the same thing with torch.rand() instead of torch.ones() - though I had to use dtype = torch.float32 in that case - and basically observed the same behavior. Do you think PyTorch will still not fill the tensor up with random numbers before we do some operations on it?
Also, any way we can explicitly test your hypothesis? For example by doing operations on the tensor that force PyTorch to write the tensor onto the hardware, but that are not memory-intensive?
@Imahn, I would create two 1 GB random tensors. Check the memory allocation. Then multiply the two tensors and store it in a variable. Then check the memory allocation again. If there is a difference in the two memory allocation checks then I would be satisfied with my answer. The reason I am suggesting 1GB is because, again, I don’t know how PyTorch implements these functions (will they allocate memory if the size is not too large - let’s say for example < 200 MB?).
@tiramisuNcustard Sorry for late reply. I tested the following code (x and y should roughly occupy 1 GB of memory):
import torch
import tracemalloc
tracemalloc.start()
x = torch.rand(256, 1024, 1024)
y = torch.rand(256, 1024, 1024)
current, peak = tracemalloc.get_traced_memory()
out = x * y
current_mult, peak_mult = tracemalloc.get_traced_memory()
print(f'Current/peak before multiplication [GB]: {current / 1024**3:.2f}, '
f'peak before multiplication: {peak / 1024**3:.2f} [GB]'
f'\nCurrent/peak after multiplication [GB]: {current_mult}, {peak_mult}'
f'\nTensor dtype: {x.dtype}, output shape: {out.shape}')
The output is:
Current/peak before multiplication [GB]: 0.00, 0.00
Current/peak after multiplication [GB]: 192.00, 192.00
Tensor dtype: torch.float32, output shape: torch.Size([256, 1024, 1024])
Tbh, I’m now totally puzzled about why the memory consumption is so huge after the multiplication (192 GB)? I would have expected 2 GB for x and y, and maybe a few GB for the multiplication, but O(100) GB for a multiplication sounds like a lot too me?
@Imahn, Your first print statement is the following Current/peak before multiplication [GB]: {current / 1024**3:.2f}, if we copy it over to the print statement after the multiplication i.e. Current/peak after multiplication [GB]: {current_mult / 1024**3:.2f} what do we get?
Maybe, the missing / 1024**3:.2f is causing the trouble?
What does out.element_size() * out.nelement() give you? Can you use it before current_mult, peak_mult = tracemalloc.get_traced_memory() and see what it shows?