In the PyTorch FSDP paper section 7.2.1, they say
FSDP cannot ensure that it always achieves the same mathematical equivalence as local training, especially with respect to the optimizer computation. This stems from the fact that the optimizer step operates on the sharded parameters, whose data layout is a function of FSDP’s FlatParameter sharding algorithm that does not respect individual parameter boundaries. As a result, any optimizer computation that depends on an original parameter’s unsharded value (e.g. vector norm), its tensor structure (e.g. approximate second-order optimizers), or require global states over all parameters will become invalid.
I have 2 questions
- How does
FlatParameter
sharding algorithm not respect individual parameter boundaries? - What is the implication of the last sentence? What kind of models cannot be trained using FSDP?