Why is my matrix multiplication requesting 90GB of ram?

I have 3 matrices I need to multiply, A: 3000 x 100, B: 100x100, C: 100x3MM. C is very large but moving all 3 of these matrices to the GPU is slightly over 2GB. Here’s the code I’m using

A = np.random.randn(3000, 100)
B = np.random.randn(100, 100)
C = np.random.randn(100, 3e6)

A_gpu = torch.from_numpy(A).cuda()
B_gpu = torch.from_numpy(B).cuda()
C_gpu = torch.from_numpy(C).cuda()

R_gpu = (A_gpu @ B_gpu @ C_gpu)

I don’t understand why so much GPU Ram is required. Isn’t the multiplication just some tiled CUDA Kernel?

C is 100x3.000.000
Your outgoing matrix is 3000x3.000.000

I mean, each of those numbers is a float32, which requries around 4 byte.
If you need to perform and save inner parameters to backprop…
Makes sense

ah shoot, I measured each individual matrix but not the result, I suppose I could create 4 streams to run on each GPU and then iterate through blocks of the right matrix