Torch.distributed.all_reduce causes memory trashing

I’ve noticed a lot of cuda memory frees when profiling my code. Doing memory profiling it seems like the output of all reduce doesn’t get reused by the caching allocator but gets accumulated and then freed.

I tested it out and disabling all reduce after my linear layer fixes the issue.

Tested on torch 2.4.0 and 2.5.1

Okay using

TORCH_NCCL_AVOID_RECORD_STREAMS=1

fixed it, shouldn’t it be the default? Refering to CUDA allocation lifetime for inputs to distributed.all_reduce

Yes, this env var is the correct fix. I think we want it to be the default, but it is a major change and requires some effort from someone to drive the flip.

1 Like