Training submodules of dynamic model architecture

Hello everyone, I am facing the following situation: Suppose we have the following neural networks:

class A(torch.nn.Module):
    def __init__(self):
        super(A, self).__init__()
        self.input = nn.Linear(INPUT_A_SZ, NEURONS_NUM)
        self.linear1 = nn.Linear(NEURONS_NUM, NEURONS_NUM)
        self.linear2 = nn.Linear(NEURONS_NUM, NEURONS_NUM)
        self.linear3 = nn.Linear(NEURONS_NUM, NEURONS_NUM)
        self.linear4 = nn.Linear(NEURONS_NUM, NEURONS_NUM)
        self.linear5 = nn.Linear(NEURONS_NUM, NEURONS_NUM)
        self.out = nn.Linear(NEURONS_NUM, DATA_VECTOR_SIZE + 1)

    def forward(self, x):
        x = F.relu(self.input(x))
        x = F.relu(self.linear1(x))
        x = F.relu(self.linear2(x))
        x = F.relu(self.linear3(x))
        x = F.relu(self.linear4(x))
        x = F.relu(self.linear5(x))
        return F.relu(self.out(x))

class B - > similar architecture with A
class C - > similar architecture with A

In the above networks, only the input size changes. My goal is to assemble these small neural networks into a bigger one, let’s call it Big_Net, where the Big_Net has a tree structured architecture. In the forward pass of Big_Net, for each intermediate node, the input for this neural network is the output of his ancestor, concatenated with the corresponding input tensor and this procedure continues until the root of the tree is reached. The final goal, is to train each one of the small neural networks of type A, B, C, etc., through the Big_Net. The difficulty here is that the architecture of Big_Net is not static, but changes all the time based on the input and there are many instances of the same subnetwork (instances of type A, B, C) in the tree. Moreover, for all the instances of the same network within the tree, I want to implement weight sharing. My first question is how to implement the weight sharing among all instances of the same network within the tree and my second question is, since the architecture of the Big_Net changes all the time, how can a define the SGD optimizer in order to train the model. Here is an example of the architecture
Screenshot from 2020-09-15 11-06-03

Thank you all in advance for your help!

Try to reuse the layers where the weight sharing happens (e.g. assign the same nn.Linear instance to the model instance). This way you don’t have to copy the layer parameters.

When defining the SGD optimizer, you can define a separate param group for each model type to apply different learning rates to them, etc. Check the docs for that. Since the instances share the parameters, be careful to not pass the same parameter to the optimizer multiple times (it gets modified each time in that case, in PyTorch 1.7 this will raise a warning). To avoid that, you can do something like that:

optimizer = optim.SGD([{'params': list(set(itertools.chain(instanceA.parameters(), instanceB.parameters(), ...)), 
                        'lr': ...},
                      ], ...)