Can we overlap compute operation with memory operation without pinned memory on CPU?

Hi, I`m trying to overlap the computation and memory operation with HuggingFace SwitchTransformer.

Here’s a detailed explanation.

  • The memory operation is for data movement from CPU to GPU, and its size is 4MB per block.
  • The number of blocks is variable (typically from 2 to 6 in total).
  • The computation operation comprises several very small computation operations like GEMM, which takes 10s to 100s microseconds per each.
  • I’m trying to use CudaStream, so I created two different Cuda streams and pushed memory operation and computation operation to each of them.
  • But it had not been overlapped.

And here’s my question.

  1. Firstly, I’d learn that to overlap the memory operation (CPU->GPU) and computation operation, the memory in the CPU should be pinned. But in my case, as can be seen in the figure, it is pageable memory, not pinned. Is it a reason that this cannot be overlapped?

  2. Second, I conducted an experiment to prove it with a simple example (overlapping GEMM with CPU->GPU memory operation), and here`s the output.


    This is pageable memory.

    This is pinned memory.

It seems like pageable memory also can be overlapped.
Then, what is the reason that my application is not overlapping?

I am also very concerned about this question but I have not found the answer. Do you have any progress?