Use an existing/pre-trained model as a block of a new model

Suppose I have a model, modelA. Can I use it as a block/layer in a new model, modelB, by directly assigning modelA as a layer in the __init__() of ModelB? What are the possible side effect? For example,

modelA = nn.Conv2d(20, 20, 5)

class ModelB(nn.Module):
    def __init__(self, modelA):
        super().__init__()
        self.conv = nn.Conv2d(1, 20, 5)
        self.convA = modelA

    def forward(self, x):
        # ...

modelB = ModelB(modelA)

I am concerned since the modelB defined above is not conventional,

  • modelA is initialized outside __init__() of ModelB

I did some tests, no error was found. However, the examples I used are simple and I am not sure if I miss anything.

Is there any difference if modelA is initialized inside __init__() of ModelB?

class ModelB(nn.Module):
    def __init__(self, modelA):
        super().__init__()
        self.conv = nn.Conv2d(1, 20, 5)
        self.convA = nn.Conv2d(20, 20, 5)

Further, what if modelA is initialized outside and some training is done before it is used in modelB?

Your approach will work and there is no difference where the submodules were initialized. Creating modelA outside of modelB allows you to use it separately (modelB holds a reference to modelA so will see the changes). This usage would break e.g. DDP, but I don’t see any issues if a) you are nor using modelA outside of modelB or if you are using a plain nn.Module without DDP etc.

Thank you for your explanation. I don’t use DDP.

Just to double check the “reference” you mentioned. To my understanding, “reference” means modelA and modelB.modelA are the same object. Hence, if I trained modelA outside or tuned the parameters, modelB.modelA will change accordingly, and vice versa.

You also mentioned “using modelA outside of modelB”, may I ask what does it mean? I thought modelA and modelB.modelA are the same. Hence, using modelA is the same as using modelB.modelA?

Yes, exactly. This can also be verified using your example:

modelA = nn.Conv2d(20, 20, 5)

class ModelB(nn.Module):
    def __init__(self, modelA):
        super().__init__()
        self.conv = nn.Conv2d(1, 20, 5)
        self.convA = modelA

    def forward(self, x):
        x = self.conv(x)
        x = self.convA(x)
        return x

modelB = ModelB(modelA)
optimizer = torch.optim.Adam(modelB.parameters(), lr=1.)

for _ in range(3):
    optimizer.zero_grad()
    out = modelB(torch.randn(1, 1, 24, 24))
    loss = out.mean()
    loss.backward()
    optimizer.step()
    print("internal: {}".format(modelB.convA.weight.abs().sum()))
    print("external: {}".format(modelA.weight.abs().sum()))
    
# internal: 10000.4697265625
# external: 10000.4697265625
# internal: 12015.4755859375
# external: 12015.4755859375
# internal: 16073.640625
# external: 16073.640625

As you can see the parameters of modelA are updated for the “external” and “internal” modelA as both are referencing the same object.

By that I meant the direct training of modelA and the training of modelA via modelB.

Yes, but the issue I’m seeing is when more complicated workflows, such as DDP, are used.
I.e. in case you are wrapping modelB into DistributedDataParallel the initialization will send the initial state_dict (including the state of modelA) to all ranks to make sure the model is exactly the same. In each training step DDP will then synchronize the gradients for you so that the optimizer.step() calls creates the same (updated) models on each rank again.
If you would train modelA directly on any rank you could easily diverge the training, which is why I wanted to add the warning. It could still work if you are careful, but I’m also seeing an easy way to diverge the training.
The same applies for quantization etc. as I’m not sure how internals work there.

1 Like