How to train Mixture-of-Experts (MoE) model with Fully Sharded Data Parallel (FSDP)

liuslnlp · November 17, 2023, 2:15pm

I attempted to replace the FFN in Transformer with MoE (implemented by fairscale). I am curious about how to integrate MoE and FSDP together. The data parallel groups for different parameters in the model are not the same, and FSDP does not provide an interface to assign different dp groups to different parameters.

For example, consider the diagram below: the model has four experts, with two placed on each rank, and the parallelism of the experts is 2. In this case, the data parallel group for non-expert parameters (such as attention and word embedding) is [0, 1, 2, 3], while the dp groups for experts are [0, 2] and [1, 3].

Reference:

github.com/facebookresearch/fairscale

[MoE] How to do expert sharding?

opened 09:12AM - 25 Nov 21 UTC

closed 02:23PM - 07 Jan 22 UTC

gongbudaizhe

## ❓ Questions and Help MoE models have a different expert on each gpu device…. Therefore, these expert parameters should *not* be allreduced during backward pass. To achieve this, p.expert is set to True as shown in: https://github.com/facebookresearch/fairscale/blob/562542478a145fae410cb23ffd46a8d7c2497abb/fairscale/nn/moe/moe_layer.py#L62-L64 My question is, how is p.expert used in DDP? Is this attribute automatically handled in ShardedDataParallel and FullyShardedDataParallel to disable allreduce?