Masking in PyTorch Transformer

Hello everyone,

I’ve been looking for some guide on how to correctly use the PyTorch transformer modules with its masking etc. I have to admit, I am still a little bit lost and would love some guidance.

I am trying to write a GPT-like model that will be trained in unsupervised manner on variable-length sequences to predict the next token in the sequence. Therefore, I have prepared my data such that I can generate a minibatch of

  • input: a batch of sequences of feature vectors, zero-padded to have the same length for each sequence
  • target: a batch of sequences as above, but the sequence is shifted one sample to the right (if input sequence is [1,2,3,4], target is [2,3,4,5])
  • padding mask: a batch of padding masks with 0 where there is a value, and 1 when there is a padding

Now, my model is a bit more complicated because my data consist of different types of inputs, but I am using a decoder-only architecture, therefore, the model is defined as

class TransformerModel(nn.Module):
    def __init__(self, dataset, num_features, d_model, nhead, num_layers, dim_feedforward, dropout):
        super(TransformerModel, self).__init__()
        self.embedding = nn.Linear(num_features, d_model)
        self.positional_encoding = PositionalEncoding(d_model)
        m = self.generate_square_subsequent_mask()
        self.mask = m
        self.transformer_layers = nn.TransformerDecoderLayer(d_model, nhead, dim_feedforward, dropout, batch_first=True)
        self.transformer = nn.TransformerDecoder(self.transformer_layers, num_layers)

        self.mean = nn.Linear(d_model, dataset.continuous_length)
        self.var = nn.Sequential(
            nn.Linear(d_model, dataset.continuous_length),
        self.binary_model = nn.Linear(d_model, dataset.binary_length)
        self.onehot_model = nn.Linear(d_model, dataset.onehot_length)
    def generate_square_subsequent_mask(self, size=200): # Generate mask covering the top right triangle of a matrix
        mask = torch.triu(torch.full((size, size), float('-inf'), device=device), diagonal=1)
        return mask
    def forward(self, src, padding_mask=None, causality_mask=None):
        # process through the model
        src = self.embedding(src)
        src = self.positional_encoding(src)
        mask_size = src.shape[1]
        # print(mask_size)
        m = self.generate_square_subsequent_mask(mask_size)
        x = self.transformer(
            src, src,                                                                 # target and memory are the same
            tgt_mask=m, memory_mask=m,                                                # triangular masks so that we do not attend to the future tokens
            tgt_key_padding_mask=padding_mask, memory_key_padding_mask=padding_mask,  # padding mask, so that we are not training on padded parts of the sequences
            norm_first=True                                                           # nowadays, layer norm is used before the attention block
        # process the outpus
        c_mean = self.mean(x)
        c_var = self.var(x)
        b = torch.sigmoid(self.binary_model(x))
        oh = self.onehot_model(x) # should be raw logits
        return c_mean, c_var, b, oh

The model has 4 outputs, but that is not the important part. The important part is the masks, since I am not sure I am using them right.

In GPT-like architecture, I cannot look at future tokens, I am using the triangular matrix which has -inf above the diagonal, zeros otherwise, for this purpose. My question is, do I use the same masks both as target_mask and memory_mask? Since there is no encoder, the input for target and memory is the same, and therefore I would suppose both need to have the same mask as well - is that correct? Same goes for the padding masks - those are the same, since target and memory are the same as well.

Thanks for any suggestions, I just want to make sure I am using the framework the right way (do not want to implement it manually as most people do in their tutorials…).