I am wondering how I can implement the logic that the result evaluated in GPU can be directly written to CPU memory, without temporarily storing it in GPU memory and then transferring it. I think it may be feasible due to the existence of the DMA technique but I am not sure about that. The motivation is that in my project the activation is large, and I want to save activation in CPU memory to save GPU memory.
I know this question may sound weird, but I really appreciate any help you can provide.
Currently I think the closest thing to this is CPU activation checkpointing which is implemented by projects such as deepspeed: Activation Checkpointing — DeepSpeed 0.12.3 documentation
In practice you may find that on many systems Device-to-Host bandwidth is actually quite limited compared to GPU memory bandwidth, which may greatly slow your application. This limitation may change in the future though with higher bandwidth unified memory systems becoming more popular.
Thanks for your reply. I understand that the device-to-host bandwidth is slow. However, in my case, the availability (i.e., the usability of training) is more important than training throughput. When the activation is large (typically common in LLM training), the memory consumption spur caused by activation generation can lead to OOM in a memory-constrained landscape. As such, I intend to avoid temporary storage for activation on GPU, which makes the activation checkpointing mechanism in deepspeed unsuitable. What I expect is that the GPU can directly write the result tensors to CPU memory, without involving the GPU memory. Is it possible?
I don’t think this is possible, but you could use the aforementioned utils. or offload the activation to the host after it was stored temporarily on the GPU as described here.