nn.Transformer for NMT

Im trying to implement the transformer for NMT

so far, my code looks like this

class transformer(nn.Module):
    def __init__(self,src_vocab_size, trg_vocab_size, emb_dim=512, m_dim = 512, d_model = 512):
        super(transf, self).__init__()
        self.transformer = nn.Transformer(d_model = d_model)
        self.Embedding = nn.Embedding(src_vocab_size, emb_dim)
        self.pos_encoder = PositionalEncoding(emb_dim)
        self.fc_out = nn.Linear(m_dim,trg_vocab_size)
    def forward(self, src,trg):
        src = self.Embedding(src)
        src = self.pos_encoder(src)
        trg = self.Embedding(trg)
        trg = self.pos_encoder(trg)
        self.trg_mask = self.transformer.generate_square_subsequent_mask(trg.shape[0]).to(device)
        out = self.transformer(src,trg, src_mask = None, tgt_mask = self.trg_mask)
        out = self.fc_out(out)
        return out

the input is

but im getting a loss too small. I think is because of the attention mask in self.trg_mask = self.transformer.generate_square_subsequent_mask(trg.shape[0]).to(device)
but im not really sure.

What do you guys think?


edit: Im thinking it’s because im creating just 1mask and i need BATCH_SIZE masks. Im trying with batch_size = 1

Some example here

Have you managed to implement the transformer for NMT? Please share you experience --Thanks

This example is not for NMT.

I tried to follow the hyperparameters in attention is all you need paper and can not reproduce the expected result.