Repeated tokens issue and tgt masking not working with nn.Transformer

So recently I have been getting issues with the nn.Transformer module in that it keeps giving me repeated tokens during even just a simple copy and paste task. To debug the model I have given it a random array of integers from 0 to 100 such those numbers depict an embedding. In this case my src “sentence” is of length 4 and my tgt is length 6 with SOS and EOS tokens included.

Code

In terms of the overall model, it is actually for the purpose of time series generation and so:

  • I have left out the embedding component since the input is a time series/number array already,
  • left out the softmax layer and replaced it with a linear layer outputting from 42 to 42 dimensions,
  • swapped out the cross entropy loss for a nn.MSELoss. (the problem should not be here since both loss functions are really “nearness” evaluators)

Example output:

Epoch 481 | Loss for this epoch: 962.3525
pred tensor([[43.8938, 40.8303, 36.3771, 33.6664, 27.7113, 30.6695, 27.7331, 27.3123,
         35.0892, 25.3893, 43.1270, 36.4314, 36.8317, 28.0335, 35.2998, 28.6168,
         28.1540, 23.6074, 45.8259, 52.5252, 71.2013, 58.4176, 26.8109, 41.6447,
         50.6960, 50.9365, 47.0124, 66.7846, 29.6854, 54.3303, 15.8350, 47.5396,
         44.6901, 28.7130, 61.2205, 51.9038, 40.1004, 39.2980, 47.0713, 36.2085,
         25.0827, 38.7290],
        [43.9919, 40.9159, 36.6921, 33.5286, 27.9414, 30.7204, 27.5557, 27.0768,
         35.1250, 25.6034, 43.3168, 36.4885, 37.4033, 28.2621, 35.0734, 28.4577,
         28.6817, 23.8487, 46.2245, 52.4522, 70.9343, 58.4096, 26.8743, 41.8313,
         50.5821, 51.3920, 46.9513, 66.5539, 29.4575, 54.1598, 15.9336, 47.1084,
         44.7686, 28.3244, 60.9213, 51.8064, 39.8448, 39.6819, 46.7811, 36.4413,
         24.7245, 38.9211],
        [42.8098, 41.4122, 36.3937, 31.8434, 27.3688, 30.7753, 27.5833, 25.9909,
         33.9562, 26.3026, 42.0265, 35.4798, 38.7668, 29.0033, 33.8233, 25.8947,
         30.3415, 22.9186, 46.3345, 51.1159, 68.9430, 57.4045, 27.0395, 42.9046,
         48.7137, 51.0624, 45.6511, 64.7647, 28.3801, 52.1444, 14.6419, 43.5562,
         43.9671, 26.6897, 58.3230, 51.1698, 37.4116, 39.8192, 45.4452, 36.0492,
         23.0649, 39.6155],
        [43.0128, 40.3056, 35.9057, 30.8145, 27.1400, 29.8846, 27.8326, 26.3446,
         33.5310, 24.7804, 40.1322, 34.3466, 36.4286, 27.7695, 34.5397, 27.2464,
         27.1590, 22.5629, 45.3599, 50.8546, 68.8225, 56.6120, 26.2396, 41.7496,
         47.3914, 47.8912, 45.7550, 64.2823, 28.5523, 51.0438, 14.6206, 43.9090,
         43.7185, 27.7892, 58.3519, 50.0493, 38.1522, 36.9787, 45.8064, 35.0708,
         23.5255, 37.5167],
        [43.7220, 37.9346, 37.5748, 29.5633, 29.1368, 28.0909, 25.2843, 24.0507,
         34.0849, 23.8064, 40.7238, 33.5393, 37.4652, 28.0920, 32.6368, 28.3762,
         26.7279, 24.7163, 46.5599, 49.9572, 66.1192, 56.0380, 24.6792, 39.9777,
         46.1135, 49.1048, 45.0796, 61.4664, 25.6820, 49.8009, 16.2992, 42.7245,
         43.5038, 25.0353, 56.4276, 46.5286, 37.6176, 36.7476, 42.3495, 36.3939,
         20.4086, 35.5653]], grad_fn=<SliceBackward>)

These repeated tokens should not occur as my generated random numbers are different at each time step.

My model:

# https://pytorch.org/tutorials/beginner/transformer_tutorial.html
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, dropout=0.0, max_len=6):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)

        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0).transpose(0, 1)
        self.register_buffer('pe', pe)

    def forward(self, x):
        x = x + self.pe[:x.size(0), :]
        return self.dropout(x)

class TransformerModel(nn.Module):
    def __init__(self, nhead=7, nlayers=7, d_model=42):
        super(TransformerModel, self).__init__()
        self.d_model=d_model
        self.pos_encoder = PositionalEncoding(self.d_model)
        self.pos_decoder = PositionalEncoding(self.d_model)

        self.transformer = nn.Transformer(d_model=self.d_model, nhead=nhead, num_encoder_layers=nlayers,
                                          num_decoder_layers=nlayers)
        self.fc_out = nn.Linear(d_model, d_model)
        
        self.src_mask = None
        self.memory_mask = None

    def forward(self, src, trg):
        src = self.pos_encoder(src)*math.sqrt(self.d_model)
        trg = self.pos_decoder(trg)*math.sqrt(self.d_model)
        output = self.transformer(src, trg, tgt_mask=nn.Transformer().generate_square_subsequent_mask(5))
        output = self.fc_out(output)
        return output

I suspect there might be something funny with the tgt masking mechanisms since the tokens are essentially repeated.

1 Like

Hi I met the similar problem about repeated tokens. Have you solved this problem? I am wondering if the problem is caused by the wrong way of using masking, but I am unsure what exactly happened.