I have 3 matrices I need to multiply, `A: 3000 x 100, B: 100x100, C: 100x3MM`

. C is very large but moving all 3 of these matrices to the GPU is slightly over 2GB. Here’s the code I’m using

```
A = np.random.randn(3000, 100)
B = np.random.randn(100, 100)
C = np.random.randn(100, 3e6)
A_gpu = torch.from_numpy(A).cuda()
B_gpu = torch.from_numpy(B).cuda()
C_gpu = torch.from_numpy(C).cuda()
R_gpu = (A_gpu @ B_gpu @ C_gpu)
```

I don’t understand why so much GPU Ram is required. Isn’t the multiplication just some tiled CUDA Kernel?