I use fsdp to init my model. The torch version is 2.7.1
Here is my code:
device_mesh = init_device_mesh(
"cuda",
mesh_shape=(1, 8),
mesh_dim_names=("replicate","shard"),
)
model = FSDP(
model,
auto_wrap_policy=size_based_auto_wrap_policy,
device_id=device,
# sharding_strategy=ShardingStrategy.FULL_SHARD,
mixed_precision=MixedPrecision(
param_dtype=torch.float,
reduce_dtype=torch.float,
buffer_dtype=torch.float,
),
sync_module_states=True,
limit_all_gathers=True,
use_orig_params=True,
device_mesh=device_mesh,
)
In my understanding, when I define the shard dimension as 8 and the replicate dimension as 1, the model should be fully sharded. However, this results in a no_shard FSDP configuration, which is confusing.
I also tried swapping the mesh_dim_names from ("replicate", "shard") to ("shard", "replicate"), but again, the model ends up with no sharding. This suggests that mesh_dim_names may not be having the intended effect. It appears that FSDP always treats the first dimension as the sharding dimension and the second as the replication dimension, regardless of the assigned names.
If I cannot explicitly assign which dimension corresponds to sharding or replication by name, how can I reliably control which dimension is used for what purpose? This becomes even more confusing when considering other parallelism dimensions such as tensor parallelism (TP) or sequence parallelism (SP), especially since I cannot directly assign a size to the TP dimension in the same way.