Can we overlap compute operation with memory operation without pinned memory on CPU?

rhwang · April 26, 2023, 4:14am

Hi, I`m trying to overlap the computation and memory operation with HuggingFace SwitchTransformer.

Here’s a detailed explanation.

The memory operation is for data movement from CPU to GPU, and its size is 4MB per block.
The number of blocks is variable (typically from 2 to 6 in total).
The computation operation comprises several very small computation operations like GEMM, which takes 10s to 100s microseconds per each.
I’m trying to use CudaStream, so I created two different Cuda streams and pushed memory operation and computation operation to each of them.
But it had not been overlapped.

And here’s my question.

Firstly, I’d learn that to overlap the memory operation (CPU->GPU) and computation operation, the memory in the CPU should be pinned. But in my case, as can be seen in the figure, it is pageable memory, not pinned. Is it a reason that this cannot be overlapped?
Second, I conducted an experiment to prove it with a simple example (overlapping GEMM with CPU->GPU memory operation), and here`s the output.

image1473×410 16.8 KB

This is pageable memory.

image1446×376 18.7 KB

This is pinned memory.

It seems like pageable memory also can be overlapped.
Then, what is the reason that my application is not overlapping?

DeanHHH · December 17, 2023, 2:31pm

I am also very concerned about this question but I have not found the answer. Do you have any progress?