I did not yet implement the distribution over multiple GPUs. On a single GPU though, the speed up will be significantly because of the architecture of the sparsely-gated MoE. Instead of passing each sample through all experts, the gating mechanism will make sure that each sample is only passed through k experts.