Sequential batched matmul on GPU

I want to compute C =, B) where A is a sparse tensor of size (10,000,000, 100,000) and B is a dense matrix of size (100,000, 50) on GPU. However, the operation runs out of memory on GPU.

I am thinking about computing the product between severals rows of A and B (something like[start_row:end_row, :], B)) on GPU a time, and clear the GPU memory and repeat the operation on next few rows. After all operation is done, stack all the result to recover the matrix multiplication.

For this approach, I am concerned about the overhead of transferring rows and the sequential nature. Is there any efficient way to do the operation using built-in functions?

Thanks for any advice!