I want to compute `C = torch.mm(A, B)`

where `A`

is a sparse tensor of size `(10,000,000, 100,000)`

and `B`

is a dense matrix of size `(100,000, 50)`

on GPU. However, the operation runs out of memory on GPU.

I am thinking about computing the product between severals rows of `A`

and `B`

(something like `torch.mm(A[start_row:end_row, :], B)`

) on GPU a time, and clear the GPU memory and repeat the operation on next few rows. After all operation is done, stack all the result to recover the matrix multiplication.

For this approach, I am concerned about the overhead of transferring rows and the sequential nature. Is there any efficient way to do the operation using built-in functions?

Thanks for any advice!