How to create model with sharing weight?

I want to create a model with sharing weights, for example: given two input A, B, the first 3 NN layers share the same weights, and the next 2 NN layers are for A, B respectively.

How to create such model, and perform optimally?


EDIT: we do support sharing Parameters between modules, but it’s recommended to decompose your model into many pieces that don’t share parameters if possible.

We don’t support using the same Parameters in many modules. Just reuse the base for two inputs:

class MyModel(nn.Module):
    def __init__(self):
        self.base = ...
        self.head_A = ...
        self.head_B = ...

    def forward(self, input1, input2):
        return self.head_A(self.base(input1)), self.head_B(self.base(input2))

in your example, what will happen to gradients of self.base? will they be calculated taking into account both input1 and input2?

Yes, you can use the same module multiple times during forward.

1 Like

There are lots of cases where you can’t just reuse a Module but you still want to share parameters (e.g. in language modeling you sometimes want your word embeddings and output linear layers to share weight matrices). I thought reusing Parameters was ok? It’s used in the PTBLM example and it’s something people will keep doing (and expect to work) unless it throws an error.


Yeah they are supported, sorry for this. But it’s still considered better practice to not do it. I’ve updated the answer.

In this code, he make share modules(‘G_block_share/D_block_share’) out of class, and then use these share modules in different two classes(‘Generator A&B or Discriminator A&B)…

This code is right way to share weights between two generator/discriminators?

Could you please tell us why it is better to not do it? Thanks


Dear Apaszke, thank you for your updates! But I am still a little confused about your answer. Like in your example, you have 3 modules (base, headA, headB), but how could you decompose them into pieces that don’t share parameters? Looking forward to your answer, please! Thank you for your attention.

1 Like

I think it is wrong. Just define the G_block_share=ResBlock() is right.

class MyModel(nn.Module):
    def __init__(self):
        self.base1 = ...
        self.head_A = ...
        self.head_B = ...

    def forward(self, input1, input2):
        return self.head_A(self.base1(input1)), self.head_B(self.base2(input2))

But in this case, how would base1 and base2 share the same weights? It seems like base1 + head_A and base2 + head_B are totally separate models.

Hi @apaszke, in regards to this mechanism for sharing weights, which is the standard way of masking?
I mean, in a batch of inputs, not all the members of the batch have the same number of inputs,
for instance in a batch of 16 sequences you may well have some sequences with 10 elements while others with 2 elements
so you can fill the sorter ones with 0s in order to match the maximum sequence length, but you do not want your weights to be adjusted in base to such padding inputs.
Therefore, which is the standard technique in pytorch for masking this weight sharing?


I think sometimes you have to use weight sharing, like in the case where you want one layer to be the transpose of another. For this case, one can do this:

shared_w =  torch.rand((n_y, n_z))*.2 - .1 # initialize somehow
self.yzdecoding = nn.Linear(n_y, n_z) # Create shared layers
self.zydecoding = nn.Linear(n_z, n_y)
self.yzdecoding.weight = nn.Parameter(shared_w.T) # Share weights
self.zydecoding.weight = nn.Parameter(shared_w)

Note that the (n_y, n_z) Linear layer has weights of shape (n_z, n_y), which may not be intuitive at first.