# How to create and train a tied autoencoder?

Some papers mentioned a tied auto encoder, in which two W matrices are identical, i.e. W_{decode} = W_{encode}.t.
How can I create such a network which two layer share a matrix but use them differently?

3 Likes

you can use the Functional interface of `import torch.nn.functional as F` and just have W and W.t() be passed to `out_encoder = F.linear(x, W)`, `out_decoder = F.linear(y, W.t())`.

Hope these hints help.

4 Likes

Thank you very much. I will try that.
BTW, what is the difference between Linear as a Function and Linear as a Module?

The Module version will take care of the weights by itself, whereas the Functional version will just apply the linear transform by using an external weight matrix, it is based on the concept of immutability, where you compose function with an immutable state.

3 Likes

If you want to you can also have two modules that share a weight matrix just by setting `mod1.weight = mod2.weight`, but the functional approach is likely to be less magical and harder to make a mistake with.

My final choice, A little bit slightly complicated…

``````class MirrorLinear(nn.Module):
def __init__(self, in_features, out_features, weight, bias=True):
super(MirrorLinear, self).__init__()
self.in_features = in_features
self.out_features = out_features
self.register_parameter('weight', weight)
if bias:
self.bias = nn.Parameter(torch.Tensor(out_features))
else:
self.register_parameter('bias', None)
self.reset_bias()

def reset_bias(self):
stdv = 1. / math.sqrt(self.weight.size(1))
if self.bias is not None:
self.bias.data.uniform_(-stdv, stdv)

def forward(self, input):
if self.bias is None:
return self._backend.Linear()(input, self.weight.t())
else:
return self._backend.Linear()(input, self.weight.t(), self.bias)

def __repr__(self):
return self.__class__.__name__ + ' (' \
+ str(self.in_features) + ' -> ' \
+ str(self.out_features) + ')'``````
2 Likes

@smth, your solution assumes we are defining a custom neural network module where W is a parameter of the network, right?

Is there any way to tie weights by only instantiating off-the-shelf nn modules without fiddling around with new parameters?

@InnovArul, I have read that post already, yet, I would like to know if there is another way to achieve that result.

1 Like

Hi, sorry for the delay. I am not sure if this is what you are looking for.
But, Please take a look at this:

I have tried to use two linear layers and shared their weights, as well as tried to use F.linear.

Hope, it helps.

Hi @InnovArul,

``````class MixedApproachTiedAutoEncoder(nn.Module):
def __init__(self, inp, out, weight):
super().__init__()
self.encoder = nn.Linear(inp, out, bias=False)

def forward(self, input):
encoded_feats = self.encoder(input)
reconstructed_output = F.linear(encoded_feats, self.encoder.weight.t())
return encoded_feats, reconstructed_output
``````

I believe that by building the network this way will result in the parameters of the decoder not being differentiated when running backward because the functional module has no state but at the same time as we are performing an operation over a parameter I would expect the network graph to be updated dynamically.

What is wrong with this train of thought?

Cheers.

To me, mixed approach looks better.

For the record, I have updated the gist to verify that it works too.

2 Likes

@InnovArul, can you explain how the weights for the decoder part are being differentiated?

As far as I know by defining a module we are storing the value of the learnable parameters (weights) unlike happens when we simply call the functional counterpart, thereby, in the mixed approach we have just defined one module/layer in the _init_ function then we passed the output of that layer to the function F, therefore, I was expecting that the weights between the hidden layer and the output layer were not automatically differentiated, right?

Thank you for keeping the gist updated In my understanding, the gradient still flows to the encoder’s weight because we are just passing the weight (`self.encoder.weight.t()`), as it is to the function in F. i.e, here the link to the parameter is still intact so that automatic differentiation happens (thanks to the integration of Variable into Tensor, I guess). Instead, if you decide to pass the raw data in `encoder.weight` (using `self.encoder.weight.data.t()`), then we are detaching the differentiation path, hence automatic differentiation will not affect `encoder.weight`.

Maybe, @tom, @smth, @ptrblck would be able to give more appropriate answer on technicalities of this. For now, this is my understanding. I am not sure if it is 100% correct.

2 Likes

As Arul says, given an `nn.Parameter` or more generally any a leaf node, a (differentiable) calculation that uses multiple times will propagate (and accumulate) gradients on backward. And this is precisely what you would want for tied weights.
A minor note: Don’ t use `.data`, it is strictly not as good as `.detach()`.

Best regards

Thomas

1 Like

this is a fascinating topic because if we can make scalable tied autoencoders (“coders”) then we can get twice the number of optimization steps on 1/2 the number of parameters – potentially 4x improvement …

how do you make sure tied autoencoder forward step does gradients properly? add reverse function?

Arul,

Thanks for your code example, it is truly helpful. I just have one question, should we set require_grad=False for the decoder layer after setting self.decoder.weight.data = self.encoder.weight.data.transpose(0,1)?

Thanks a lot.

you do not need to set `require_grad=False` for decoder.
We expect that the tied weight receives gradients from both encoder and decoder.