Hi, I recently noticed this line in the all-reduce and broadcast collectives
Do we need to do these redundant copies?
This is making TP with DTensors slower than TP with normal tensors.
Yes @agu is correct, functional collectives by design is out of place so it would always allocating the output tensor first, if turning on torch.compile it would re-inplace it.
@mayank31398 I wonder how much slowness you observed?
For training we usually use SequenceParallel by default, where for SequenceParallel the allocation would happen anyways and the input/output shape of the allgather/reduce_scatter are different. For the case where there’s no SequenceParallel, iirc we benchmarked the e2e training on llama models and it does not show observable slowness compare to a TP that works on normal tensors.
@wanchaol there is a reasonable difference when not using compile.
Is there not a plan of moving to in-place?
The copy is redundant and honestly compile doesn’t work with everything so there are a lot of problems with using out-of-place here