Torch.distributed.all_reduce causes memory trashing

SzymonOzog (Szymon Ożóg) January 6, 2025, 12:41pm 2

Okay using

TORCH_NCCL_AVOID_RECORD_STREAMS=1

fixed it, shouldn’t it be the default? Refering to CUDA allocation lifetime for inputs to distributed.all_reduce