I have 3 matrices I need to multiply,
A: 3000 x 100, B: 100x100, C: 100x3MM. C is very large but moving all 3 of these matrices to the GPU is slightly over 2GB. Here’s the code I’m using
A = np.random.randn(3000, 100) B = np.random.randn(100, 100) C = np.random.randn(100, 3e6) A_gpu = torch.from_numpy(A).cuda() B_gpu = torch.from_numpy(B).cuda() C_gpu = torch.from_numpy(C).cuda() R_gpu = (A_gpu @ B_gpu @ C_gpu)
I don’t understand why so much GPU Ram is required. Isn’t the multiplication just some tiled CUDA Kernel?