Sequence Parallel examples

Hi, this link provides an example about tensor parallel together with sp on norm layers and FSDP:

However, this parallel is a mainly Megatron style TP. I just wanna know how to write simple sequence parallel style codes only on sequence dim with the same llama model(not the other quite simple sp example), like what happend in DeepSpeed-Ulyses. So can any provide minimal code changes, e.g. the specific kind of parallelize_plan? I would appreciate a lot.

good question. We do not have an out of the box example for SP-only. We had a tutorial for SP but SP is always on top of TP Large Scale Transformer model training with Tensor Parallel (TP) — PyTorch Tutorials 2.4.0+cu121 documentation

Thanks, I’ve seen this doc and it’s same with my provided link. After a day, I realized sequence-only parallel might can be down but need to convert model weight to DTensor. However, SP-only is meaningless, a Ulysses-like parallel is meaningful.
Thus here I wanna know things in detail: Since Ulysses handles SP with all-to-all matrix transpose and similar attention-head parallism toTP, I think my question is can we done this with DTensor now? or in detail, can DTensor automatically call all-to-all communication?