How to create model with sharing weight?

xiaozhun07 · February 8, 2017, 3:00pm

I want to create a model with sharing weights, for example: given two input A, B, the first 3 NN layers share the same weights, and the next 2 NN layers are for A, B respectively.

How to create such model, and perform optimally?

apaszke · February 8, 2017, 5:31pm

EDIT: we do support sharing Parameters between modules, but it’s recommended to decompose your model into many pieces that don’t share parameters if possible.

We don’t support using the same Parameters in many modules. Just reuse the base for two inputs:

class MyModel(nn.Module):
    def __init__(self):
        self.base = ...
        self.head_A = ...
        self.head_B = ...

    def forward(self, input1, input2):
        return self.head_A(self.base(input1)), self.head_B(self.base(input2))

vladimir · April 24, 2017, 6:05am

in your example, what will happen to gradients of self.base? will they be calculated taking into account both input1 and input2?

apaszke · April 24, 2017, 8:22am

Yes, you can use the same module multiple times during forward.

jekbradbury · April 27, 2017, 7:47pm

There are lots of cases where you can’t just reuse a Module but you still want to share parameters (e.g. in language modeling you sometimes want your word embeddings and output linear layers to share weight matrices). I thought reusing Parameters was ok? It’s used in the PTBLM example https://github.com/pytorch/examples/blob/master/word_language_model/model.py and it’s something people will keep doing (and expect to work) unless it throws an error.

apaszke · April 27, 2017, 8:46pm

Yeah they are supported, sorry for this. But it’s still considered better practice to not do it. I’ve updated the answer.

11189 · May 8, 2018, 11:44pm

github.com

seokinj/coPassGAN/blob/master/models.py#L27


        nn.ReLU(True),
        nn.Conv1d(DIM, DIM, 5, padding=2),#nn.Linear(DIM, DIM),
        nn.ReLU(True),
        nn.Conv1d(DIM, DIM, 5, padding=2),#nn.Linear(DIM, DIM),
    )


def forward(self, input):
    output = self.res_block(input)
    return input + (0.3*output)


G_block_share1 = ResBlock()
G_block_share2 = ResBlock()


D_block_share4 = ResBlock()
D_block_share5 = ResBlock()


class Generator_A(nn.Module):


def __init__(self, charmap):
    super(Generator_A, self).__init__()

In this code, he make share modules(‘G_block_share/D_block_share’) out of class, and then use these share modules in different two classes(‘Generator A&B or Discriminator A&B)…

This code is right way to share weights between two generator/discriminators?

N_Hunter · December 6, 2018, 2:55am

Could you please tell us why it is better to not do it? Thanks

ZHANGHeng19931123 · April 14, 2019, 9:33pm

Dear Apaszke, thank you for your updates! But I am still a little confused about your answer. Like in your example, you have 3 modules (base, headA, headB), but how could you decompose them into pieces that don’t share parameters? Looking forward to your answer, please! Thank you for your attention.

Baichuan · February 27, 2020, 9:54am

I think it is wrong. Just define the G_block_share=ResBlock() is right.
https://pytorch.org/tutorials/beginner/examples_nn/dynamic_net.html#pytorch-control-flow-weight-sharing

Baichuan · February 27, 2020, 9:55am

class MyModel(nn.Module):
    def __init__(self):
        self.base1 = ...
        self.base2=...
        self.head_A = ...
        self.head_B = ...

    def forward(self, input1, input2):
        return self.head_A(self.base1(input1)), self.head_B(self.base2(input2))

Hazel · February 12, 2021, 11:26am

But in this case, how would base1 and base2 share the same weights? It seems like base1 + head_A and base2 + head_B are totally separate models.

dariodematties · May 20, 2021, 8:50pm

Hi @apaszke, in regards to this mechanism for sharing weights, which is the standard way of masking?
I mean, in a batch of inputs, not all the members of the batch have the same number of inputs,
for instance in a batch of 16 sequences you may well have some sequences with 10 elements while others with 2 elements
so you can fill the sorter ones with 0s in order to match the maximum sequence length, but you do not want your weights to be adjusted in base to such padding inputs.
Therefore, which is the standard technique in pytorch for masking this weight sharing?

Thanks!

Alex_Li · June 18, 2022, 5:35pm

I think sometimes you have to use weight sharing, like in the case where you want one layer to be the transpose of another. For this case, one can do this:

shared_w =  torch.rand((n_y, n_z))*.2 - .1 # initialize somehow
self.yzdecoding = nn.Linear(n_y, n_z) # Create shared layers
self.zydecoding = nn.Linear(n_z, n_y)
self.yzdecoding.weight = nn.Parameter(shared_w.T) # Share weights
self.zydecoding.weight = nn.Parameter(shared_w)

Note that the (n_y, n_z) Linear layer has weights of shape (n_z, n_y), which may not be intuitive at first.

yogeshluthra · September 22, 2023, 3:52pm

Thanks for question. I have a similar use case. Did you find an answer?