I’m trying to understand to understand the example in Getting Started with DeviceMesh — PyTorch Tutorials 2.8.0+cu128 documentation
from torch.distributed.device_mesh import init_device_mesh
mesh_3d = init_device_mesh("cuda", (2, 2, 2), mesh_dim_names=("replicate", "shard", "tp"))
# Users can slice child meshes from the parent mesh.
hsdp_mesh = mesh_3d["replicate", "shard"]
tp_mesh = mesh_3d["tp"]
Doesn’t FSDP already shard the params/optimizer states/gradients across the nodes? What does it mean to do TP, when already FSDP-style sharding is applied?
And another question for Large Scale Transformer model training with Tensor Parallel (TP) — PyTorch Tutorials 2.8.0+cu128 documentation which suggests using FSDP across nodes and TP within the node. Doesn’t using FSDP across nodes not also auto-shard params/optimizer states/gradients? Or does this tutorial suggests configuring fully_shard without sharding params/optimizer states/gradients to only keep the model replication aspect? In other words, doesn’t FSDP always does some sort of (inefficient) tensor parallelism in the form of sharding params across ranks? And how does TP disable the FSDP’s default behavior of “gather all shards on every GPU before layer forward“? Or how does using TP change the default’s FSDP sharding of params between the GPUs? And how does using TP change the default’s FSDP communication patterns at every layer forward?
Often (for simplicity? for brevity?) these tutorials are missing the bits which are actually harder to understand…
Thanks!