How to tie embeddings?

Hi,

I have seen people using tied embeddings in machine translation,

  1. but since the source and target languages will have different vocabularies, how can we use target embeddings?

  2. How to implement tied embeddings in Pytorch? Just using the same embedding would word, but what if I have defined my encoder and decoder as separate classes?

Thanks.

Hm, I use the same embedding for autoencoder networks, but here the vocabularies are obviously the same. No idea how this would work for different languages/vocabularies.

If your setup allows you to use the same embedding, implementing it is easy: you just define the embedding layer first, and then give it to the encoder and decoder class (e.g., in the constructor). You can have I look how I do it for a RNN and CNN autoencoder. The important snippet is:

class TextCnnAE:

    def __init__(self, device, params, criterion):
        self.params = params
        self.device = device
        self.vocab_size = params.vocab_size
        self.embed_dim = params.embed_dim

        # Embedding layer, shared by encoder and decoder
        self.embedding = nn.Embedding(self.vocab_size, self.embed_dim, max_norm=1, norm_type=2)

        # Calculate the 2-tuples for the kernel sizes (the last one depends on the max_seq_len)
        max_seq_len, kernel_sizes = self.calc_last_seq_len()
        self.params.kernel_sizes[-1] = max_seq_len

        # Create encoder rnn and decoder rnn module
        self.encoder = Encoder(device, self.embedding, params, max_seq_len, kernel_sizes)
        self.decoder = Decoder(device, self.embedding, params, max_seq_len, kernel_sizes)

        ...
1 Like

Sorry, I’ve made an error. The tying of weights does not refer to simply sharing the embedding layer among encoder and encoder, although this makes sense as well, of course. I’ve looked into that and extended my RNN-based and CNN-based autoencoders to allow for tied weights (just search for self.params.tie_weights in the code)

I don’t think that it’s the perfectly correct implementation, but it works and actually seems to improve the training - I only tested it so far with some smaller dataset.