CUDA out of memory. Tried to allocate 16.00 GiB (GPU 0; 10.76 GiB total capacity; 536.52 MiB already allocated; 8.83 GiB free; 736.00 MiB reserved in total by PyTorch)

Hi, I successfully ran the code in this project: GitHub - erichson/NFM .
For some reason when I tried to add this function, which doesn’t really do anything, I got CUDA out of memory.

I was calling this function in every mini-batch iteration in the training, and after the first iteration, I got the error. (in the loss.backward() calling)
In addition, if I’m removing lines 43-44 (in the picture), the code is working well.
What could be the cause of this?

I’m not sure how to understand this description, as your method is clearly calling into torch.linalg.svd and multiple ops.

I would assume that lines 43-44 try to allocate the 16GB or memory. Did you check the memory requirement for these calls (e.g. you could use the input shapes to calculate the expected output shape for the matmul etc.)?

Thanks for the fast replay.

the input shape is [2,65536] (2 is the batch size)
the shape of the ‘u’ variable is [2,2]
the shape of the ‘s’ variable is [2,2]
the shape of the ‘v’ variable is [2,65536]
It looks like the multiplication of ‘u’ ‘s’ and ‘v’ is trying to allocate the 16GB, but it’s a bit wired because the shapes of the variables are not so big.
and in addition for any batch size (not only 2), it’s trying to allocate 16GB.

I think the shape of s would be [2], but nevertheless I cannot reproduce the issue using the input shape and get a memory allocation of ~2MB:

device = 'cuda'
print(torch.cuda.memory_allocated() / 1024**3)
# 0.0

z = torch.randn(2, 65536, device=device)
print(torch.cuda.memory_allocated() / 1024**3)
# 0.00048828125

u, s, v = torch.linalg.svd(z, full_matrices=False)
# torch.Size([2, 2])
# torch.Size([2])
# torch.Size([2, 65536])
print(torch.cuda.memory_allocated() / 1024**3)
# 0.0009775161743164062

tmp =, v)
# torch.Size([2, 65536])
print(torch.cuda.memory_allocated() / 1024**3)
# 0.0014657974243164062

out =, tmp)
# torch.Size([2, 65536])
print(torch.cuda.memory_allocated() / 1024**3)
# 0.0019540786743164062

I print the Cuda memory allocated as you did, and I got similar results.
My problem was raised when the loss.backward() function was called.
Probably for some reason the back prop trying to allocate 16GB, no matter what is the batch size.

Maybe in another part of the code, but not in the posted one.
You can set the requires_grad attribute of the input to True, reduce the output, and call .backward() on it which yields:

print(torch.cuda.memory_allocated() / 1024**3)
# 0.0024423599243164062