Best way to tie LSTM weights?

Suppose there are two different LSTMs/BiLSTMs and I want to tie their weights. What is the best way to do it? There does not seem to be any torch.nn.Functional interface. If I simple assign the weights after instantiating the LSTMs like
self.lstm2.weight_ih_l0 = self.lstm1.weight_ih_l0
etc, it seems to work but there are two issues. I get the “UserWarning: RNN module weights are not part of single contiguous chunk of memory.” warning. More importantly, there seems to be memory leakage as the GPU memory utilization keeps on increasing and the process runs out of memory after a while.
What is the best way to tie weights of two different LSTMs?

1 Like

Contiguous warnings are unavoidable for tied weights, unfortunately, due to cudnn limitations. Don’t know what’s going on with the memory.

1 Like

One solution is to lstm1.weight_il_l0.data.copy_(lstm2.weight_il_l0.data) for all weights after each optim update.

1 Like

If we do the copy after every optim update, what happens to the gradients? Or are you suggesting to tie the weights as I mentioned in the question and do the copy?

Oh I see the issue. Okay maybe like this:

...
loss.backward()
for w1, w2 in zip(lstm1.parameters(), lstm2.parameters()):
  w1.grad.data.add_(w2.grad.data)
  w2.grad = None
optim.step()
for w1, w2 in zip(lstm1.parameters(), lstm2.parameters()):
  w2.data.copy_(w1.data)

Might be slow though

4 Likes

Aha! that should work. I will try that out to check for speed. Thanks!

1 Like

How did it perform in terms of speed? Thanks!

Why does it make sense to sum up all the gradients of the same parameters with respect to the loss?

w1 = w
w2 = w
l = f(w1,w2)
dl/dw = dl/dw1 + dl/dw2