I suspect the current way of specifying shardings (via mesh and placements) in DTensor is ambiguous. Any way to resolve that?
Example:
in_tensor = torch.arange(6)
mesh = init_device_mesh("cuda", [2, 3]) # [[GPU0, GPU1, GPU2], [GPU3, GPU4, GPU5]]
distribute_tensor(in_tensor, mesh, [Shard(0), Shard(0)])
doesn’t specify whether to split size=6 by 2 first or by 3 first. If first by 2, then GPU 1 will hold number 1; if first by 3, GPU 1 will hold number 2.
I guess this is an important distinction for context parallelism. The sequence dimension of the input activation to a transformer layer is both context parallelized and tensor parallelized. Whether to apply CP or TP first affects how data is distributed. However, maybe, in practice an arbitrary order has been good enough