Sequence Parallel examples

edrt · August 17, 2024, 5:52pm

Hi, this link provides an example about tensor parallel together with sp on norm layers and FSDP:

pytorch/examples/blob/main/distributed/tensor_parallelism/fsdp_tp_example.py

import sys
import os
import torch
import torch.distributed as dist
import torch.nn as nn
import torch.nn.functional as F

from log_utils import rank_log, get_logger, verify_min_gpu_count

# ---- GPU check ------------
_min_gpu_count = 4

if not verify_min_gpu_count(min_gpus=_min_gpu_count):
    print(f"Unable to locate sufficient {_min_gpu_count} gpus to run this example. Exiting.")
    sys.exit()
# ---------------------------

from llama2_model import Transformer, ModelArgs

from torch.distributed.device_mesh import init_device_mesh

This file has been truncated. show original

However, this parallel is a mainly Megatron style TP. I just wanna know how to write simple sequence parallel style codes only on sequence dim with the same llama model(not the other quite simple sp example), like what happend in DeepSpeed-Ulyses. So can any provide minimal code changes, e.g. the specific kind of parallelize_plan? I would appreciate a lot.

weifengpy · August 19, 2024, 6:33am

good question. We do not have an out of the box example for SP-only. We had a tutorial for SP but SP is always on top of TP Large Scale Transformer model training with Tensor Parallel (TP) — PyTorch Tutorials 2.4.0+cu121 documentation

edrt · August 20, 2024, 1:51am

Thanks, I’ve seen this doc and it’s same with my provided link. After a day, I realized sequence-only parallel might can be down but need to convert model weight to DTensor. However, SP-only is meaningless, a Ulysses-like parallel is meaningful.
Thus here I wanna know things in detail: Since Ulysses handles SP with all-to-all matrix transpose and similar attention-head parallism toTP, I think my question is can we done this with DTensor now? or in detail, can DTensor automatically call all-to-all communication?