How to apply Shared embedding

Aiman_Mutasem-bellh · May 9, 2021, 8:37pm

Dear all

I’m working on a grammatical error correction (GEC) task based on neural machine translation (NMT). The only difference between GEC and NMT is the shared embedding.

NMT embedding:

SRC = Field(tokenize= tokenizer, init_token=‘’, eos_token=‘’, batch_first=True)
TRG = Field(tokenize= tokenizer, init_token=‘’, eos_token=‘’, batch_first=True)

train_data, valid_data = TabularDataset.splits(path=‘…/data/’,train=‘train.csv’,
validation=‘valid.csv’ , format=‘csv’,
fields=[(‘src’, SRC), (‘trg’, TRG)], skip_header=True)

My implementation of shared embedding is like this:

TRG = Field(tokenize= tokenizer, init_token=‘’, eos_token=‘’, batch_first=True)

train_data, valid_data = TabularDataset.splits(path=‘…/data/’,train=‘train.csv’,
validation=‘valid.csv’ , format=‘csv’,
fields=[(‘src’, TRG), (‘trg’, TRG)], skip_header=True)

But the results are not good, what is the optimal implementation of shared embedding in Pytorch?

Kind regards,
Aiman Solyman

Aiman_Mutasem-bellh · May 10, 2021, 8:48pm

Hope to get a little support

Unity05 · May 12, 2021, 9:11pm

Hi Aiman,

firstly, could you provide a link to a paper where you’re trying to implement the GEC model from (if that’s the case) as I’m not familiar with GEC.
However, aren’t your two field objects in the NMT code block redundant? Therefore the second implementation would be equal to the first one (assuming that ‘train.csv’ and ‘valid.csv’ have not changed).
As already mentioned, I’m not familiar with GEC models, but why don’t you just use an embedding layer?

Regards,
Unity05

Aiman_Mutasem-bellh · May 15, 2021, 11:13pm

Thank you sir for your kind feedback.

I’m using this example, and my implementation is like :

class InputEmbeddingAndPositionalEncodingLayer(nn.Module):

    def __init__(self, vocab_size, max_len, d_model, dropout):
        super(InputEmbeddingAndPositionalEncodingLayer, self).__init__()
        self.vocab_size = vocab_size
        self.max_len = max_len
        self.d_model = d_model
        self.dropout = nn.Dropout(p=dropout)
        self.token_embedding = nn.Embedding(vocab_size, d_model)
        self.position_encoding = PositionalEncodingLayer(d_model=d_model, max_len=max_len)

    def forward(self, sequences):
        """
        :param Tensor[batch_size, seq_len] sequences
        :return Tensor[batch_size, seq_len, d_model]
        """
        token_embedded = self.token_embedding(sequences) # [batch_size, seq_len, d_model]
        position_encoded = self.position_encoding(sequences) # [batch_size, seq_len, d_model]
        return self.dropout(token_embedded) + position_encoded # [batch_size, seq_len, d_model]

shared_embeddings = InputEmbeddingAndPositionalEncodingLayer(
    vocab_size=len(TRG.vocab),
    max_len=MAX_LEN,
    d_model=D_MODEL,
    dropout=DROPOUT
)

transformer = Transformer(
    encoder=EncoderLayer(
        in_emb_pos_enc_layer=shared_embeddings,
        d_model=D_MODEL,
        n_heads=N_HEADS,
        hidden_size=HIDDEN_SIZE,
        dropout=DROPOUT,
        n_layers=N_LAYERS
    ),
    decoder=DecoderLayer(
        in_emb_pos_enc_layer=shared_embeddings,
        vocab_size=len(TRG.vocab),
        d_model=D_MODEL,
        n_heads=N_HEADS,
        hidden_size=HIDDEN_SIZE,
        dropout=DROPOUT,
        n_layers=N_LAYERS
    ),
    src_pad_index=TRG.vocab.stoi[TRG.pad_token],
    dest_pad_index=TRG.vocab.stoi[TRG.pad_token]
).to(DEVICE)

The result still not good as same as the implementation without shared embedding, could you give me an example?

One more thing, I used shared vocabulary size so no worries about the ‘train.csv’ and ‘valid.csv’ changes.

Unity05 · May 15, 2021, 11:53pm

The code you’ve sent looks fine to me.

The result still not good as same as the implementation without shared embedding, could you give me an example?

You’ve got better results on your GEC task with the same implementation but seperate embeddings?

Aiman_Mutasem-bellh · May 16, 2021, 12:01am

Yes sir, as bellow example:

SRC = Field(tokenize= tokenizer, init_token=’’, eos_token=’’, batch_first=True)
TRG = Field(tokenize= tokenizer, init_token=’’, eos_token=’’, batch_first=True)

But all the papers in GEC mentioned that Shared embedding is the only difference between GEC and NMT, which is the source and target are in the same language. So, I thought there is an issue with my implementation.

Unity05 · May 16, 2021, 12:22am

Well, the thing that comes up in my mind first is that your using the same identical layer twice. Therefore there might be problems with the gradients as they are accumulated I assume.

Aiman_Mutasem-bellh · May 16, 2021, 12:25am

Correct, could you help ?

Unity05 · May 16, 2021, 1:06am

If this really causes the problem, my first approach would be to try modifying the respective gradients ‘manually’ (i.e. the embedding gradients taking mean of both gradient by hooking the grad and sub the embedding weights ‘manually’).