Best way to tie LSTM weights?

sidbrahma · January 18, 2018, 6:13pm

Suppose there are two different LSTMs/BiLSTMs and I want to tie their weights. What is the best way to do it? There does not seem to be any torch.nn.Functional interface. If I simple assign the weights after instantiating the LSTMs like
self.lstm2.weight_ih_l0 = self.lstm1.weight_ih_l0
etc, it seems to work but there are two issues. I get the “UserWarning: RNN module weights are not part of single contiguous chunk of memory.” warning. More importantly, there seems to be memory leakage as the GPU memory utilization keeps on increasing and the process runs out of memory after a while.
What is the best way to tie weights of two different LSTMs?

ngimel · January 18, 2018, 6:42pm

Contiguous warnings are unavoidable for tied weights, unfortunately, due to cudnn limitations. Don’t know what’s going on with the memory.

SimonW · January 18, 2018, 6:58pm

One solution is to lstm1.weight_il_l0.data.copy_(lstm2.weight_il_l0.data) for all weights after each optim update.

sidbrahma · January 18, 2018, 7:39pm

If we do the copy after every optim update, what happens to the gradients? Or are you suggesting to tie the weights as I mentioned in the question and do the copy?

SimonW · January 18, 2018, 7:42pm

Oh I see the issue. Okay maybe like this:

...
loss.backward()
for w1, w2 in zip(lstm1.parameters(), lstm2.parameters()):
  w1.grad.data.add_(w2.grad.data)
  w2.grad = None
optim.step()
for w1, w2 in zip(lstm1.parameters(), lstm2.parameters()):
  w2.data.copy_(w1.data)

Might be slow though

sidbrahma · January 18, 2018, 7:45pm

Aha! that should work. I will try that out to check for speed. Thanks!

anglil · July 2, 2018, 5:20pm

How did it perform in terms of speed? Thanks!

anglil · July 6, 2018, 6:52pm

Why does it make sense to sum up all the gradients of the same parameters with respect to the loss?

SimonW · July 7, 2018, 6:54am

w1 = w
w2 = w
l = f(w1,w2)
dl/dw = dl/dw1 + dl/dw2