How to apply Shared embedding

Dear all :slight_smile:

I’m working on a grammatical error correction (GEC) task based on neural machine translation (NMT). The only difference between GEC and NMT is the shared embedding.

NMT embedding:

SRC = Field(tokenize= tokenizer, init_token=‘’, eos_token=‘’, batch_first=True)
TRG = Field(tokenize= tokenizer, init_token=‘’, eos_token=‘’, batch_first=True)

train_data, valid_data = TabularDataset.splits(path=‘…/data/’,train=‘train.csv’,
validation=‘valid.csv’ , format=‘csv’,
fields=[(‘src’, SRC), (‘trg’, TRG)], skip_header=True)

My implementation of shared embedding is like this:

TRG = Field(tokenize= tokenizer, init_token=‘’, eos_token=‘’, batch_first=True)

train_data, valid_data = TabularDataset.splits(path=‘…/data/’,train=‘train.csv’,
validation=‘valid.csv’ , format=‘csv’,
fields=[(‘src’, TRG), (‘trg’, TRG)], skip_header=True)

But the results are not good, what is the optimal implementation of shared embedding in Pytorch?

Kind regards,
Aiman Solyman

Hope to get a little support :slight_smile:

Hi Aiman,

firstly, could you provide a link to a paper where you’re trying to implement the GEC model from (if that’s the case) as I’m not familiar with GEC.
However, aren’t your two field objects in the NMT code block redundant? Therefore the second implementation would be equal to the first one (assuming that ‘train.csv’ and ‘valid.csv’ have not changed).
As already mentioned, I’m not familiar with GEC models, but why don’t you just use an embedding layer?

Regards, :slight_smile:
Unity05

1 Like

Thank you sir for your kind feedback.

I’m using this example, and my implementation is like :

class InputEmbeddingAndPositionalEncodingLayer(nn.Module):

    def __init__(self, vocab_size, max_len, d_model, dropout):
        super(InputEmbeddingAndPositionalEncodingLayer, self).__init__()
        self.vocab_size = vocab_size
        self.max_len = max_len
        self.d_model = d_model
        self.dropout = nn.Dropout(p=dropout)
        self.token_embedding = nn.Embedding(vocab_size, d_model)
        self.position_encoding = PositionalEncodingLayer(d_model=d_model, max_len=max_len)

    def forward(self, sequences):
        """
        :param Tensor[batch_size, seq_len] sequences
        :return Tensor[batch_size, seq_len, d_model]
        """
        token_embedded = self.token_embedding(sequences) # [batch_size, seq_len, d_model]
        position_encoded = self.position_encoding(sequences) # [batch_size, seq_len, d_model]
        return self.dropout(token_embedded) + position_encoded # [batch_size, seq_len, d_model]

shared_embeddings = InputEmbeddingAndPositionalEncodingLayer(
    vocab_size=len(TRG.vocab),
    max_len=MAX_LEN,
    d_model=D_MODEL,
    dropout=DROPOUT
)
transformer = Transformer(
    encoder=EncoderLayer(
        in_emb_pos_enc_layer=shared_embeddings,
        d_model=D_MODEL,
        n_heads=N_HEADS,
        hidden_size=HIDDEN_SIZE,
        dropout=DROPOUT,
        n_layers=N_LAYERS
    ),
    decoder=DecoderLayer(
        in_emb_pos_enc_layer=shared_embeddings,
        vocab_size=len(TRG.vocab),
        d_model=D_MODEL,
        n_heads=N_HEADS,
        hidden_size=HIDDEN_SIZE,
        dropout=DROPOUT,
        n_layers=N_LAYERS
    ),
    src_pad_index=TRG.vocab.stoi[TRG.pad_token],
    dest_pad_index=TRG.vocab.stoi[TRG.pad_token]
).to(DEVICE)

The result still not good as same as the implementation without shared embedding, could you give me an example?

One more thing, I used shared vocabulary size so no worries about the ‘train.csv’ and ‘valid.csv’ changes.

The code you’ve sent looks fine to me.

The result still not good as same as the implementation without shared embedding, could you give me an example?

You’ve got better results on your GEC task with the same implementation but seperate embeddings? :thinking:

Yes sir, as bellow example:

SRC = Field(tokenize= tokenizer, init_token=’’, eos_token=’’, batch_first=True)
TRG = Field(tokenize= tokenizer, init_token=’’, eos_token=’’, batch_first=True)

But all the papers in GEC mentioned that Shared embedding is the only difference between GEC and NMT, which is the source and target are in the same language. So, I thought there is an issue with my implementation.

Well, the thing that comes up in my mind first is that your using the same identical layer twice. Therefore there might be problems with the gradients as they are accumulated I assume.

Correct, could you help ?

If this really causes the problem, my first approach would be to try modifying the respective gradients ‘manually’ (i.e. the embedding gradients taking mean of both gradient by hooking the grad and sub the embedding weights ‘manually’).