Is save_on_cpu() overlaps the computation and communication?

I’m thinking of using torch.autograd.graph.save_on_cpu() to deal with some offload process, but I’m curious of whether the saved_tensors_hooks mechanism overlaps the computation and communication while executing, i.e. the communication operations such as offload could be done without blocking the computation. This could be helpful as the code with context manager will be more concise compare to the CUDA stream version.

I haven’t profiled this context manager, but based on this code the packed tensor is using pinned memory, so the copy should be non-blocking.
The copy_ call seems to be missing the non_blocking=True argument, so unsure if that’s on purpose or just missed.

CC @Varal7 as you seem to be the author: did you miss to use the non_blocking=True argument in the copy_?

Thanks a lot for your in time reply! Looking forward to the author’s reply!

Hi, Thank you for the tag!

I vaguely remember that using non_blocking=True from GPU to CPU might be dangerous (Should we set non_blocking to True? - #18 by sbelharbi), so we only use when copying from CPU to GPU.

For reference, this is the PR that introduced the feature: Add default hooks to save tensors on CPU by Varal7 · Pull Request #61928 · pytorch/pytorch · GitHub