How to create and train a tied autoencoder?

Chen_HY · May 2, 2017, 5:55pm

Some papers mentioned a tied auto encoder, in which two W matrices are identical, i.e. W_{decode} = W_{encode}.t.
How can I create such a network which two layer share a matrix but use them differently?

smth · May 3, 2017, 3:13am

you can use the Functional interface of import torch.nn.functional as F and just have W and W.t() be passed to out_encoder = F.linear(x, W), out_decoder = F.linear(y, W.t()).

Hope these hints help.

Chen_HY · May 3, 2017, 11:44am

Thank you very much. I will try that.
BTW, what is the difference between Linear as a Function and Linear as a Module?

trypag · May 3, 2017, 2:18pm

The Module version will take care of the weights by itself, whereas the Functional version will just apply the linear transform by using an external weight matrix, it is based on the concept of immutability, where you compose function with an immutable state.

jekbradbury · May 3, 2017, 6:48pm

If you want to you can also have two modules that share a weight matrix just by setting mod1.weight = mod2.weight, but the functional approach is likely to be less magical and harder to make a mistake with.

Chen_HY · May 4, 2017, 10:28am

My final choice, A little bit slightly complicated…

class MirrorLinear(nn.Module):
    def __init__(self, in_features, out_features, weight, bias=True):
        super(MirrorLinear, self).__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.register_parameter('weight', weight)
        if bias:
            self.bias = nn.Parameter(torch.Tensor(out_features))
        else:
            self.register_parameter('bias', None)
        self.reset_bias()

        def reset_bias(self):
            stdv = 1. / math.sqrt(self.weight.size(1))
            if self.bias is not None:
                self.bias.data.uniform_(-stdv, stdv)

        def forward(self, input):
            if self.bias is None:
                return self._backend.Linear()(input, self.weight.t())
            else:
                return self._backend.Linear()(input, self.weight.t(), self.bias)

        def __repr__(self):
            return self.__class__.__name__ + ' (' \
                + str(self.in_features) + ' -> ' \
                + str(self.out_features) + ')'

Filipe_Silva · August 7, 2018, 4:55am

@smth, your solution assumes we are defining a custom neural network module where W is a parameter of the network, right?

Is there any way to tie weights by only instantiating off-the-shelf nn modules without fiddling around with new parameters?

InnovArul · August 7, 2018, 5:14am

This would be more appropriate answer for your question.

Filipe_Silva · August 7, 2018, 5:46am

@InnovArul, I have read that post already, yet, I would like to know if there is another way to achieve that result.

InnovArul · August 7, 2018, 1:33pm

Hi, sorry for the delay. I am not sure if this is what you are looking for.
But, Please take a look at this:

gist.github.com

https://gist.github.com/InnovArul/500e0c57e88300651f8005f9bd0d12bc

tied_linear.py

import torch, torch.nn as nn, torch.nn.functional as F
import numpy as np
import torch.optim as optim

# tied autoencoder using off the shelf nn modules
class TiedAutoEncoderOffTheShelf(nn.Module):
	def __init__(self, inp, out, weight):
		super().__init__()
		self.encoder = nn.Linear(inp, out, bias=False)
		self.decoder = nn.Linear(out, inp, bias=False)

This file has been truncated. show original

I have tried to use two linear layers and shared their weights, as well as tried to use F.linear.

Hope, it helps.

Filipe_Silva · August 7, 2018, 9:41pm

Hi @InnovArul,

Thank you for your answer, it was quite enlightening!

Yet, what are your thoughts about this approach:

class MixedApproachTiedAutoEncoder(nn.Module):
	def __init__(self, inp, out, weight):
		super().__init__()
		self.encoder = nn.Linear(inp, out, bias=False)

	def forward(self, input):
		encoded_feats = self.encoder(input)
		reconstructed_output = F.linear(encoded_feats, self.encoder.weight.t())
		return encoded_feats, reconstructed_output

I believe that by building the network this way will result in the parameters of the decoder not being differentiated when running backward because the functional module has no state but at the same time as we are performing an operation over a parameter I would expect the network graph to be updated dynamically.

What is wrong with this train of thought?

Cheers.

InnovArul · August 8, 2018, 12:23am

To me, mixed approach looks better.

For the record, I have updated the gist to verify that it works too.

gist.github.com

https://gist.github.com/InnovArul/500e0c57e88300651f8005f9bd0d12bc

tied_linear.py

import torch, torch.nn as nn, torch.nn.functional as F
import numpy as np
import torch.optim as optim

# tied autoencoder using off the shelf nn modules
class TiedAutoEncoderOffTheShelf(nn.Module):
    def __init__(self, inp, out, weight):
        super().__init__()
        self.encoder = nn.Linear(inp, out, bias=False)
        self.decoder = nn.Linear(out, inp, bias=False)

This file has been truncated. show original

Filipe_Silva · August 8, 2018, 1:28am

@InnovArul, can you explain how the weights for the decoder part are being differentiated?

As far as I know by defining a module we are storing the value of the learnable parameters (weights) unlike happens when we simply call the functional counterpart, thereby, in the mixed approach we have just defined one module/layer in the _init_ function then we passed the output of that layer to the function F, therefore, I was expecting that the weights between the hidden layer and the output layer were not automatically differentiated, right?

Thank you for keeping the gist updated

InnovArul · August 8, 2018, 6:56am

In my understanding, the gradient still flows to the encoder’s weight because we are just passing the weight (self.encoder.weight.t()), as it is to the function in F. i.e, here the link to the parameter is still intact so that automatic differentiation happens (thanks to the integration of Variable into Tensor, I guess). Instead, if you decide to pass the raw data in encoder.weight (using self.encoder.weight.data.t()), then we are detaching the differentiation path, hence automatic differentiation will not affect encoder.weight.

Maybe, @tom, @smth, @ptrblck would be able to give more appropriate answer on technicalities of this. For now, this is my understanding. I am not sure if it is 100% correct.

tom · August 8, 2018, 8:55am

As Arul says, given an nn.Parameter or more generally any a leaf node, a (differentiable) calculation that uses multiple times will propagate (and accumulate) gradients on backward. And this is precisely what you would want for tied weights.
A minor note: Don’ t use .data, it is strictly not as good as .detach().

Best regards

Thomas

Bion_Howard · December 30, 2018, 3:28am

@tom @InnovArul

this is a fascinating topic because if we can make scalable tied autoencoders (“coders”) then we can get twice the number of optimization steps on 1/2 the number of parameters – potentially 4x improvement …

how do you make sure tied autoencoder forward step does gradients properly? add reverse function?

Di_Wu · May 17, 2020, 10:18pm

Arul,

Thanks for your code example, it is truly helpful. I just have one question, should we set require_grad=False for the decoder layer after setting self.decoder.weight.data = self.encoder.weight.data.transpose(0,1)?

Thanks a lot.

InnovArul · May 23, 2020, 4:32am

you do not need to set require_grad=False for decoder.
We expect that the tied weight receives gradients from both encoder and decoder.