When I use FSDP in pytorch, I print each rank’s parameters and calculate parameter’s number, I find different ranks has different param’s number?
some rank has more sharding parameter? why pytorch design it?
When I use FSDP in pytorch, I print each rank’s parameters and calculate parameter’s number, I find different ranks has different param’s number?
some rank has more sharding parameter? why pytorch design it?