Implementing Mixture of Expert layer

davidmrau · September 1, 2019, 6:24am

I did not yet implement the distribution over multiple GPUs. On a single GPU though, the speed up will be significantly because of the architecture of the sparsely-gated MoE. Instead of passing each sample through all experts, the gating mechanism will make sure that each sample is only passed through k experts.