I am trying to implement the a mixture of expert layer, similar to the one described in:

https://arxiv.org/abs/1701.06538

and already discussed in this thread. By reading some threads about the topic I found the following sentence.

“The MoE (Mixture of Experts Layer) is trained using back-propagation. The Gating Network outputs an (artificially made) sparse vector that acts as a chooser of which experts to consult. More than one expert can be consulted at once.”

I am not sure if the experts here are pre-trained or not. I am not really sure if training involves just the Gating Network or the full layer (Gating network with experts). (If anybody is familiar with this model, please explain this to me if possible)

At any case, I have built 3 neural network `(model1; model2 and model3)`

in which I’ve already trained and tuned and I want to include these to the MoE layer to improve the overall accuracy.

The code has the following class

class MoE(nn.Module):

```
"""Call a Sparsely gated mixture of experts layer with 1-layer Feed-Forward networks as experts.
Args:
input_size: integer - size of the input
output_size: integer - size of the input
num_experts: an integer - number of experts
hidden_size: an integer - hidden size of the experts
noisy_gating: a boolean
k: an integer - how many experts to use for each batch element
"""
def __init__(self, input_size, output_size, num_experts, hidden_size, noisy_gating=True, k=4):
super(MoE, self).__init__()
self.noisy_gating = noisy_gating
self.num_experts = num_experts
self.output_size = output_size
self.input_size = input_size
self.hidden_size = hidden_size
self.k = k
# instantiate experts
self.experts = nn.ModuleList([MLP(self.input_size, self.output_size, self.hidden_size) for i in range(self.num_experts)])
self.w_gate = nn.Parameter(torch.zeros(input_size, num_experts), requires_grad=True)
self.w_noise = nn.Parameter(torch.zeros(input_size, num_experts), requires_grad=True)
self.softplus = nn.Softplus()
self.softmax = nn.Softmax(1)
self.normal = Normal(torch.tensor([0.0]), torch.tensor([1.0]))
assert(self.k <= self.num_experts)
```

I changed the line

```
self.experts = nn.ModuleList([MLP(self.input_size, self.output_size, self.hidden_size) for i in range(self.num_experts)])
```

with my pretrained models

```
self.experts = nn.ModuleList([model1, model2, model3])
```

But I don’t know if this is enough. I know my question is kind of vague/or complicated. But at this point I am lost and frustrated, any kind of information is really helpful for me at this poin