How to make CUDA automatically find required data piece from matrix in CPU side, then only feed these data into CUDA

When using CUDA device on pytorch, we normally need to send whole piece of data block (like matrix) to CUDA side, but when a matrix is very large, it would exceed CUDA’s capacity and we have to use torch.sparse methods to handle with it.

Now the thinking is, besides the sparse format, is there any pytorch functionalities that could automatically find which piece (row, column) of data from a matrix, and then only send these pieces into CUDA? Even though the original matrix is locating in CPU side.

For example, we are calculating torch.mm(A, B), where both A and B are extremely large and in dense format. Each time when multiple a row in A and a col in B, the expected pytorch functionalities could automatically find now who is in calculating and send them into CUDA. After getting some results, send them back to CPU and then turn head to next pieces of data who need CUDA travel.

The overall efficiency maybe low since it has frequent CUDA-CPU interaction, but it may worthy to do that for some very complex operations. Thus, is there any pytorch methods support to do such an automatic travel?