Tying weights for between two Linear layers

John_Watkins · August 22, 2020, 7:29pm

I have quick question about weight sharing/tying. Suppose I have two Linear modules in an encoder-decoder framework:
layer_e = torch.nn.Linear(20, 50)
layer_d = torch.nn.Linear(50, 20)
And I wish for the weights of the two modules to be tied. How would I go bout doing this? Specifically, the weight of layer_e and layer_d must be tied for both initialization and backpropagation. So after training the entire framework, the weights of layer_e and layer_d must still be the same.

Previous posts about potential solution to this problem seems to have some flaws. For example:

layer_d.weights = layer_e.weights.T
This does not work, as transpose of any variant (.T, .t(), .transpose(0, 1)) all changes the weights from Parameter class into a Tensor. This assignment raises an error.
layer_d.weights = torch.nn.parameter.Parameter(layer_e.weights.T)
This method creates an entirely new set of parameters for layer_d. While the initial value is a copy of the layer_e.weights. It is not tied in backpropagation, so layer_d.weights and layer_e.weights will be different after training.
layer_d = torch.nn.functional.linear(input, layer_e.weights.T)
This reassigns the entire layer_d. This may work if layer_e.weight.T is returning the original weights. However, this changes the layer_d from a Module to a function, which is really inconvenient when considered with respect to the existing codebase.

Any help is appreciated.

ayalaa2 · August 22, 2020, 7:41pm

This works for me: layer_e.weight[:] = layer_d.weight.T[:].

John_Watkins · August 22, 2020, 9:13pm

Thank you for your response.

However, while the assignment works, when passing the parameters to initialize the optimizer, the following error is returned.

ValueError: can’t optimize a non-leaf Tensor

I am wondering what is the reason behind this, and how to solve the error.

ayalaa2 · August 25, 2020, 7:13pm

I would honestly just write your own module for this. You can initialize a tensor that is to be shared and inside of its forward method, use those weights in a transposed and non-transposed way. Perhaps have it as a flag that’s passed in through the forward method. I think you’ll get long-term weirdness when trying to strap two linear layers together.