CUDA out of memory. Tried to allocate 16.00 GiB (GPU 0; 10.76 GiB total capacity; 536.52 MiB already allocated; 8.83 GiB free; 736.00 MiB reserved in total by PyTorch)

koren · April 27, 2022, 1:24pm

Hi, I successfully ran the code in this project: GitHub - erichson/NFM .
For some reason when I tried to add this function, which doesn’t really do anything, I got CUDA out of memory.

I was calling this function in every mini-batch iteration in the training, and after the first iteration, I got the error. (in the loss.backward() calling)
In addition, if I’m removing lines 43-44 (in the picture), the code is working well.
What could be the cause of this?

ptrblck · April 27, 2022, 5:12pm

I’m not sure how to understand this description, as your method is clearly calling into torch.linalg.svd and multiple torch.mm ops.

I would assume that lines 43-44 try to allocate the 16GB or memory. Did you check the memory requirement for these calls (e.g. you could use the input shapes to calculate the expected output shape for the matmul etc.)?

koren · April 27, 2022, 8:21pm

Thanks for the fast replay.

the input shape is [2,65536] (2 is the batch size)
the shape of the ‘u’ variable is [2,2]
the shape of the ‘s’ variable is [2,2]
the shape of the ‘v’ variable is [2,65536]
It looks like the multiplication of ‘u’ ‘s’ and ‘v’ is trying to allocate the 16GB, but it’s a bit wired because the shapes of the variables are not so big.
and in addition for any batch size (not only 2), it’s trying to allocate 16GB.

ptrblck · April 28, 2022, 5:38am

I think the shape of s would be [2], but nevertheless I cannot reproduce the issue using the input shape and get a memory allocation of ~2MB:

device = 'cuda'
print(torch.cuda.memory_allocated() / 1024**3)
# 0.0

z = torch.randn(2, 65536, device=device)
print(torch.cuda.memory_allocated() / 1024**3)
# 0.00048828125

u, s, v = torch.linalg.svd(z, full_matrices=False)
print(u.shape)
# torch.Size([2, 2])
print(s.shape)
# torch.Size([2])
print(v.shape)
# torch.Size([2, 65536])
print(torch.cuda.memory_allocated() / 1024**3)
# 0.0009775161743164062

tmp = torch.mm(torch.diag(s), v)
print(tmp.shape)
# torch.Size([2, 65536])
print(torch.cuda.memory_allocated() / 1024**3)
# 0.0014657974243164062

out = torch.mm(u, tmp)
print(out.shape)
# torch.Size([2, 65536])
print(torch.cuda.memory_allocated() / 1024**3)
# 0.0019540786743164062

koren · April 28, 2022, 7:25am

I print the Cuda memory allocated as you did, and I got similar results.
My problem was raised when the loss.backward() function was called.
Probably for some reason the back prop trying to allocate 16GB, no matter what is the batch size.

ptrblck · April 28, 2022, 7:29am

Maybe in another part of the code, but not in the posted one.
You can set the requires_grad attribute of the input to True, reduce the output, and call .backward() on it which yields:

out.mean().backward()
print(torch.cuda.memory_allocated() / 1024**3)
# 0.0024423599243164062