Is save_on_cpu() overlaps the computation and communication?

Hi, Thank you for the tag!

I vaguely remember that using non_blocking=True from GPU to CPU might be dangerous (Should we set non_blocking to True? - #18 by sbelharbi), so we only use when copying from CPU to GPU.

For reference, this is the PR that introduced the feature: Add default hooks to save tensors on CPU by Varal7 · Pull Request #61928 · pytorch/pytorch · GitHub