Okay using
TORCH_NCCL_AVOID_RECORD_STREAMS=1
fixed it, shouldn’t it be the default? Refering to CUDA allocation lifetime for inputs to distributed.all_reduce
Okay using
TORCH_NCCL_AVOID_RECORD_STREAMS=1
fixed it, shouldn’t it be the default? Refering to CUDA allocation lifetime for inputs to distributed.all_reduce