Transfer data GPU -> CPU and compute on GPU in parallel

mseeger · March 14, 2026, 2:23pm

I am a beginner with PyTorch distributed, but never found this point explained anywhere.

I am trying to implement gradient computation with activation checkpointing, where the model weights are kept on CPU, and only weights of the current layer (shard) are loaded to GPU. One iteration works like this (here, l counts down from the top):

Transfer weights for layer l from CPU to GPU
Run backward for layer l
Write gradients for layer l to CPU

I’d like to do memory transfer and computation in parallel. Using two copies of shards, I’d like to run these in parallel:

Run backward for layer l
Write back gradients for layer l+1 (GPU→CPU) and load weights for layer l-1 (CPU→GPU)

I read about dist.isend and dist.ircv, but this is async between two devices. What I want to do here, should run on a single device.

I also read about adding non_blocking=True to all copy_ and to statements used to transfer between GPU and CPU. I can do that, but I am sceptical as to whether this really works, without me being more explicit about what should be done in parallel when (just like I do with isend and ircv).

In general, async CPU←→ GPU transfer and GPU computation (on a single device) remains a big mistery to me. Thanks for pointing out some docs that explain this.

mseeger · March 14, 2026, 3:27pm

I am (trying to) reading A guide on good usage of non_blocking and pin_memory() in PyTorch — PyTorch Tutorials 2.11.0+cu130 documentation. There, it says that non_blocking=True ensures that CPU→GPU transfers are non-blocking on the host (i.e., on CPU), but in general blocking on the device. There is a complex example given for when this can be avoided.

This is all a bit odd to me. Why can device-to-device transfer run in parallel to computations (on both devices), but transfer CPU-to-GPU cannot?

PingJ · March 16, 2026, 9:50pm

if you are using a single GPU you should generally NOT use torch.distributed package at all.

for this specific use case the non_blocking=True version of torch.to() suffice.

additionally, you may want to use Nsight Systems | NVIDIA Developer or GitHub - XuehaiPan/nvitop: An interactive NVIDIA-GPU process viewer and beyond, the one-stop solution for GPU process management. · GitHub to monitor the system to understand what is happening on the hardware

Best,

Ping

mseeger · March 17, 2026, 12:26pm

Hello Ping, are you sure?

I am using nsys, and there are clearly cudaStreamSynchronize blocks visible which take quite some time.

Also, I am not interested in non-blocking on the host. I am not doing anything on the host. I am interested in CPU–GPU memory transfer not blocking on the device.

The guide I am citing above states that such transfer is typically blocking for the device, but it also cites one way to get around it.

mjoux · March 20, 2026, 8:10am

It’s not entirely clear to me what you mean by “blocking on the device”. On the device, operations are typically stream-ordered, so if your device supports enough physical work queues, and you place operations in separate streams, and there is no contention between resources, nothing will be blocking on the device.
For your specific use case, you need at least 3 streams: one for the actual compute (Forward or backward), one for offloading activations of the previous layer and one for onloading activations for the next layer.

There are several utilities to help you do that: if you are using Nvidia transformer engine (or are able to switch to it), it supports everything you need through get_cpu_offload_context: PyTorch — Transformer Engine 2.12.0 documentation

There seems to be a feature request for this for standard PyTorch FSDP: Feature Request: Activation offloading with async prefetch in FSDP · Issue #174960 · pytorch/pytorch · GitHub

And if you want to roll your own implementation, I used this for a GTC 2024 presentation: llama-recipes/src/llama_recipes/interleaved_offload.py at 60307ec06d7a4edf6d02f402d29c0eb093aac9fa · MatthiasKohl/llama-recipes · GitHub

These implementations might be much more generic and complicated than what you actually need, but they will always be based on some core principle: place computation in one stream, place copies in one direction in a second stream and place copies in the other direction in a third stream. Also note that this kind of offloading doesn’t always make sense at all : it might make more sense to distribute across multiple devices or simply use a device with larger memory if it’s possible.

mseeger · March 20, 2026, 2:28pm

Thanks Matt, this helps me a ton! I am not so versed in these things. “blocking on the device” is something I took from the blog post, where there was also an example, using another stream. I was looking for more input on how this works, and you provided it.

For what I am trying to do, CPU offloading is quite essential. I am trying to compute training gradients for long context models. The gold standard there is RingAttention, which uses several devices along the context dimension. All fine, but then these are not available anymore for larger batches or other nice ideas to use several GPUs. I am not happy with the idea of using many GPUs to enable long contexts, also because many many people just do not have lots of GPUs, or need to work on GPUs with less memory than the very latest (I am one of them!). You might say: Why does this guy want to train long context models with little hardware? Well, fine-tuning and RL to create specialized SLMs is all the rage, and we want to have long contexts for these as well.

mjoux · March 24, 2026, 1:36pm

In general, fine-tuning often comes up as a use-case where CPU offloading can be beneficial, and fine-tuning is used quite a lot by large companies with access to many GPU resources as well.
So what you describe sounds like a reasonable use-case for offloading to me.
If you want to implement this efficiently, I think your best bet is to try and use Transformer Engine or take a look at my custom implementation (linked above): even though this custom implementation might not work perfectly with the latest PyTorch anymore, if you give this to any recent coding AI model, it will probably quickly give you a robust implementation for latest PyTorch.