Torch.distributed.all_reduce causes memory trashing

Okay using

TORCH_NCCL_AVOID_RECORD_STREAMS=1

fixed it, shouldn’t it be the default? Refering to CUDA allocation lifetime for inputs to distributed.all_reduce