I want to compute C = torch.mm(A, B)
where A
is a sparse tensor of size (10,000,000, 100,000)
and B
is a dense matrix of size (100,000, 50)
on GPU. However, the operation runs out of memory on GPU.
I am thinking about computing the product between severals rows of A
and B
(something like torch.mm(A[start_row:end_row, :], B)
) on GPU a time, and clear the GPU memory and repeat the operation on next few rows. After all operation is done, stack all the result to recover the matrix multiplication.
For this approach, I am concerned about the overhead of transferring rows and the sequential nature. Is there any efficient way to do the operation using built-in functions?
Thanks for any advice!