torch.matmul(torch.randn(16, 4, 7056, 10), torch.randn(16, 4, 10, 7056))
RuntimeError: CUDA out of memory. Tried to allocate 11.87 GiB (GPU 0; 10.91 GiB total capacity; 99.79 MiB already allocated; 9.00 GiB free; 4.21 MiB cached)
The output size is
torch.Size([16, 4, 7056, 7056])
So, it really requires 11.87 GB on GPU. If you want whole output at once, you need a GPU with more memory than 12 GB.
You don’t give much of a context, maybe this is obvious but in case not:
You can allocate the matrices on the CPU, then iterate over the first two dimensions, and send the sliced tensors to GPU for the actual
matmult (which only computes on the last two dimensions).