One question: why does not FSDP shard all paramters into each rank in average?

When I use FSDP in pytorch, I print each rank’s parameters and calculate parameter’s number, I find different ranks has different param’s number?

image

some rank has more sharding parameter? why pytorch design it?