I’m working on a large model TP training. the prep part of this model runs fast and contains small number of parameters, so I don’t want to design plans for them to decrease maintain cost.
- If a linear layer have no parallel plan (not in transformer blocks) and have replicate input, each device inside same TP group will have same weight gradient, do they get all-reduced across TP group?
- If a linear layer have no parallel plan and have Shard input (local tensor), each device inside same TP group will have different weight gradient, do they get all-reduced across TP group?
- If a RMSNorm layer have Sequence Parallel plan and have Shard input, its weight becomes Replicate, when do they get all-reduced across TP group?
- optimizer will raise error mixed DTensor/Tensor if a model contains both local tensor (no parallel plan) and DTensor, we have two option for this, the first one is create two param groups and group them by type of tensor, the second one is create plans for all layers. which one is best in practice?