Memory usage when calculating an array by chunks

yiftach · July 6, 2023, 12:24pm

I want to verify I understand how things work.

Consider the following two alternatives (some unimportant details are left out, but I could construct a more complete example if needed):

ys = list()
for x in xs:
    l.append(f(x))
ys = torch.stack(ys)

vs

ys = torch.tensor(len(xs), ...)
for i, x in enumerate(xs):
    ys[i] = f(x)

My question - is it true that while both options are fairly similar, the peak memory usage of the 2nd is higher? I would think that if c is the memory needed for during computation of f(x), we will need c + (len(xs) + 1) * size(f(x)) instead of just c + len(xs) * size(f(x)), because in the second version the whole array exists in memory (hence len(xs) * size(f(x))) and we calculate f(x) (hence c + size(f(x))), and rather than directly using the memory at ys[i], we calculate using a new memory chunk and finally copy to ys[i].
Will be happy to confirm this or learn how to verify such things.

ptrblck · July 6, 2023, 11:00pm

The second approach should reduce the peak memory since you are not creating the intermediate tensor in torch.stack.
Here is an example showing the allocated and max. memory usage.
Note that I’ve disabled the cuBLAS workspace allocation to make the reports clearer as it would be reported otherwise.

import torch
import torch.nn as nn

import os

# disallow cuBLAS to allocate a workspace to get a better understanding of the memory allocations
# alternatively you can also clear it via torch._C._cuda_clearCublasWorkspaces()
os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":0:0"

###############################################################################
## setup
device = "cuda"

xs = torch.randn(1024, 16, 1024).to(device)
print(torch.cuda.memory_allocated()/1024**2)
# 64.0

f = nn.Linear(1024, 1024, bias=False).to(device)
print(torch.cuda.memory_allocated()/1024**2)
# 68.0

print(torch.cuda.max_memory_allocated()/1024**2)
# 68.0

###############################################################################
## 1st approach
ys = list()
for x in xs:
    ys.append(f(x))

print(torch.cuda.memory_allocated()/1024**2)
# 132.0 = 68.0 + 1024*16*1024*4 / 1024**2 = 68.0 + 64.0
print(torch.cuda.max_memory_allocated()/1024**2)
# 132.0

ys = torch.stack(ys)

print(torch.cuda.memory_allocated()/1024**2)
# 132.0
print(torch.cuda.max_memory_allocated()/1024**2)
# 196.0 = 132.0 + 64.0 (temp)

###############################################################################
## 2nd approach
# make sure we are at the same stage
print(torch.cuda.memory_allocated()/1024**2)
# 68.0

print(torch.cuda.max_memory_allocated()/1024**2)
# 68.0

ys = torch.empty_like(xs)
for i, x in enumerate(xs):
    ys[i] = f(x)

print(torch.cuda.memory_allocated()/1024**2)
# 132.0

print(torch.cuda.max_memory_allocated()/1024**2)
# 132.0625

yiftach · July 7, 2023, 7:24am

First of all, thanks @ptrblck for the detailed answer and the elaborate code.
I still have a couple of questions.

I played with your example and I think the correct answer can vary. While still working with a toy example, I’m trying to imitate my real usage - f requires a high amount of memory, but only temporarily (f contains attention blocks), which is the c from my example.
I modified your code in a slightly silly way just to describe that situation - I replace the linear block with:
```
def f(x):
  torch.zeros(1024 ** 3 // 4).to(device)
  return x
```

Also, for nicer numbers and faster time, I changed to xs = torch.randn(1024, 16, 1024).to(device).
Now, my numbers are:
1 MB in first block - we only allocate for x
1st approach:1.0, 1025.0, 2.0, 1025.0
2nd approach: 2.0, 1025.0, 2.0, 1026.0
So in a theoretical GPU with 1025 GPU memory, the first approach fits and the second doesn’t - right? Or did I cheat?

You allocate on the CPU and then use.to(device) instead of directly allocating on the device - is it important here or just for quick prototyping?
Could you please elaborate more on the effect we cancel when we do os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":0:0"? What might happen if we don’t do that?

Much appreciated,
Yiftach

ptrblck · July 7, 2023, 8:15am

Your code doesn’t correspond to a real use case as you are creating intermediate tensors, which are not used or assigned at all. So it seems as if you are now profiling the caching mechanism, how fast tensors are freed (i.e. returned to the cache etc.). But yes, generally just profiling your real use case might still be a good idea.
No, you should directly allocate on the device for performance reasons.
The reported memory won’t “perfectly” match the expected one only considering the tensors since cuBLAS will allocate its workspace, too, as seen here:

## setup
device = "cuda"

xs = torch.randn(1024, 1024).to(device)
lin = nn.Linear(1024, 1024, bias=False).to(device)

print(torch.cuda.memory_allocated()/1024**2)
# 8.0

out = lin(xs)
print(torch.cuda.memory_allocated()/1024**2)
# 20.125

torch._C._cuda_clearCublasWorkspaces()
print(torch.cuda.memory_allocated()/1024**2)
# 12.0

The workspace allocation is irrelevant for your use case, which is why I’ve disabled it to create memory allocation values which are easier to understand.

yiftach · July 7, 2023, 11:25am

Fair enough, I may have over-simplified. Here is an example of a more realistic f, where the second approach requires more memory (at its peak):

import torch
import torch.nn as nn
import os

os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":0:0"

###############################################################################
## setup
device = "cuda"

xs = torch.randn(16, 4 * 1024, 256, device=device)
print(torch.cuda.memory_allocated()/1024**2)
# 64.0

f = nn.MultiheadAttention(256, 1, device=device)
print(torch.cuda.memory_allocated()/1024**2)
# 65.00390625

print(torch.cuda.max_memory_allocated()/1024**2)
# 65.00390625

###############################################################################
## 1st approach
ys = list()
for x in xs:
    ys.append(f(x, x, x)[0])

print(torch.cuda.memory_allocated()/1024**2)
# 1409.00390625
print(torch.cuda.max_memory_allocated()/1024**2)
# 1477.00390625

ys = torch.stack(ys)

print(torch.cuda.memory_allocated()/1024**2)
# 1473.00390625
print(torch.cuda.max_memory_allocated()/1024**2)
# 1477.00390625

###############################################################################
## 2nd approach
# make sure we are at the same stage
print(torch.cuda.memory_allocated()/1024**2)
# 65.00390625

print(torch.cuda.max_memory_allocated()/1024**2)
# 65.00390625

ys = torch.empty_like(xs)
for i, x in enumerate(xs):
    ys[i] = f(x, x, x)[0]

print(torch.cuda.memory_allocated()/1024**2)
# 1409.00390625

print(torch.cuda.max_memory_allocated()/1024**2)
# 1481.00390625

To be honest, even in this case of MHA, only certain choices for the sequence length and the hidden dimension make the 2nd approach worse, and the difference is small in any case. Still, I believe the answer may depend on f. Perhaps a general rule could be constructed about when each is favorable (or one may resort to profiling).