Customizing the Mixture of Expert layer

I am trying to implement the a mixture of expert layer, similar to the one described in:

and already discussed in this thread. By reading some threads about the topic I found the following sentence.

“The MoE (Mixture of Experts Layer) is trained using back-propagation. The Gating Network outputs an (artificially made) sparse vector that acts as a chooser of which experts to consult. More than one expert can be consulted at once.”

I am not sure if the experts here are pre-trained or not. I am not really sure if training involves just the Gating Network or the full layer (Gating network with experts). (If anybody is familiar with this model, please explain this to me if possible)

At any case, I have built 3 neural network (model1; model2 and model3) in which I’ve already trained and tuned and I want to include these to the MoE layer to improve the overall accuracy.

The code has the following class

class MoE(nn.Module):

"""Call a Sparsely gated mixture of experts layer with 1-layer Feed-Forward networks as experts.
input_size: integer - size of the input
output_size: integer - size of the input
num_experts: an integer - number of experts
hidden_size: an integer - hidden size of the experts
noisy_gating: a boolean
k: an integer - how many experts to use for each batch element

def __init__(self, input_size, output_size, num_experts, hidden_size, noisy_gating=True, k=4):
    super(MoE, self).__init__()
    self.noisy_gating = noisy_gating
    self.num_experts = num_experts
    self.output_size = output_size
    self.input_size = input_size
    self.hidden_size = hidden_size
    self.k = k
    # instantiate experts

    self.experts = nn.ModuleList([MLP(self.input_size, self.output_size, self.hidden_size) for i in range(self.num_experts)])

    self.w_gate = nn.Parameter(torch.zeros(input_size, num_experts), requires_grad=True)
    self.w_noise = nn.Parameter(torch.zeros(input_size, num_experts), requires_grad=True)

    self.softplus = nn.Softplus()
    self.softmax = nn.Softmax(1)
    self.normal = Normal(torch.tensor([0.0]), torch.tensor([1.0]))

    assert(self.k <= self.num_experts)

I changed the line

    self.experts = nn.ModuleList([MLP(self.input_size, self.output_size, self.hidden_size) for i in range(self.num_experts)])

with my pretrained models

    self.experts = nn.ModuleList([model1, model2, model3])

But I don’t know if this is enough. I know my question is kind of vague/or complicated. But at this point I am lost and frustrated, any kind of information is really helpful for me at this poin

I am not familiar with this work, but your code reflects what you want to do.

Note that if you don’t want these pre-trained models to be updated.
You could “hide” them to the optimizer by storing them in an actual list self.experts = [model1, model2, model3] or set requires_grad to False for all their parameters:

for m in [model1, model2, model3]:
  for p in m.parameters():