Masked word prediction problem

Whatever i give input its predict same value; example:

Input:
[90, 91, 26, 62, 92, 93, 26, 94, 95, 96]
incumbering soil and washed into immediate and glittering popularity possibly
Masked Input:
[90, 91, 26, 62, 92, 93, 26, 1, 95, 96]
incumbering soil and washed into immediate and unnk popularity possibly
Output:
[90, 91, 26, 62, 92, 93, 26, 33, 95, 96]
incumbering soil and washed into immediate and the popularity possibly

As you can see like this, it always predict “the” token.

Model:

class Kemal(nn.Module):
    def __init__(self, src_vocab_size, embedding_size, num_heads, dim_forward, num_encoder_layers, max_len, src_pad_idx, dropout, device):
        super(Kemal, self).__init__()
        
        self.src_word_embedding = nn.Embedding(src_vocab_size, embedding_size)
        self.src_position_embedding = nn.Embedding(max_len, embedding_size)
        
        self.device = device
        
        self.encoder_norm = nn.LayerNorm(embedding_size) 
 
        self.encoder_layer = nn.TransformerEncoderLayer(embedding_size, num_heads, dim_feedforward=dim_forward, dropout=dropout, activation='gelu')
        self.encoder = nn.TransformerEncoder(self.encoder_layer, num_encoder_layers, self.encoder_norm)
        
        self.fc = nn.Linear(embedding_size, src_vocab_size)
        
        self.src_pad_idx = src_pad_idx
        
    def make_src_pad_mask(self, src):
        src_mask = src.transpose(0, 1) == self.src_pad_idx
        return src_mask
        # (N, src_len)
        
    def forward(self, src):
        src_seq_lenght, N = src.shape
        
        src_mask = nn.Transformer.generate_square_subsequent_mask(None, src_seq_lenght).to(self.device)
        
        src_positions = (
            torch.arange(0, src_seq_lenght).unsqueeze(1).to(self.device)
        )
        
        embed_src = (self.src_word_embedding(src) + self.src_position_embedding(src_positions))
        src_padding_mask = self.make_src_pad_mask(src)
        out = self.encoder(embed_src, mask=src_mask, src_key_padding_mask=src_padding_mask)
        out = self.fc(out)
         
        return out

Thanks in advance :slight_smile:

I forgot add training loop, here it is:

# Hyperparameters
src_vocab_size = len(vocab)
d_model = 512
nhead = 8
dim_forward = 1024
num_layers = 12
max_len = 10
pad_idx = 0
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
 
load_model = False
save_model = True
 
learning_rate = 1e-5
src_pad_idx = 0
 
model = Kemal(src_vocab_size, d_model, nhead, dim_forward, num_layers, max_len, pad_idx, 0.3, device).to(device)
 
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
criterion = nn.CrossEntropyLoss()
#scheduler = optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.95)
 
loss_list = []
 
model.train() 
for epoch in range(500):
    for sequence in range(int(len(encoded_text[0:30])/10)):
        x, y = get_data(encoded_text[0:30], sequence)
        x = masking(x)
        x = torch.LongTensor(x).to(device)
        y = torch.LongTensor(y).to(device)
        x = x.reshape(1, -1)

        out = model(x)
        out = out.view(-1, len(vocab))
 
        optimizer.zero_grad()
        loss = criterion(out, y)
        loss.backward()

        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.5)
            
        if sequence % 1000 == 0 and sequence != 0:
          print("Loss: ", loss, "S: ", sequence)
 
        optimizer.step()

    #scheduler.step()

    if epoch % 1 == 0:
      if len(loss_list) > 0:
        print(f'Epoch: {epoch}, Step: {sequence}, Loss: {sum(loss_list) / len(loss_list)}')
      loss_list.append(loss)

Your training loop part looks fine to me.

  1. Are you referring to an open-source codebase for this?
  2. Can you overfit your model to a small dataset first and ensure the model is training perfectly before generalizing? You can also set the random seeds to ensure consistent masking.
1 Like

Hello, thanks for reply

  1. I tried to develop the model using Aladdin Persson and Pytorch examples (Word level language model), but both were not designed for my purpose, so I tried to play with their mixture.

  2. I tried overfitting with 2 sentence and 5000 epochs but it’s same

  3. I already random masking

def masking(src):
    mask_idx = random.randint(1, 10)
    src[mask_idx] = 1
    return src

Can be my model isn’t good for purpose?