In TransformerEncoder, is src_key_padding_mask enough for proper backprop?

I want to use a transformer encoder for classification/regression over sequence of vectors, not necessarily text.

However, the problem is that these sequences have different length, meaning I need to pad in order to process efficiently with batches (unfortunately, nested tensors are not supported yet in TransformerEncoderLayer).

My question is whether the src_key_padding_mask is enough for proper backpropagation, so the padding doesn’t influece the gradient/update of parameters during training.

Masking is only applied during the attention operation, but how does this ensure that all layers of the transformer encoder, e.g. feedforward, layernorm etc, are not influenced by the padded inputs? Is the masking of loss what effectively ignores them?

Lets assume the following architecture. A sequence of vectors is first non-linearly projected, then processed by the encoder. Finally, the whole sequence is converted to a global feature vector and processed by a task-specific head. In code:

class Model(nn.Module):
    def __init__(self, in_dim):
        def super().__init__()

        # Example of custom embedding.
        self.embedding = nn.Sequential(
                nn.Linear(in_dim, ...),
                nn.LayerNorm(...),
                nn.ReLU(),
                )
        self.encoder = nn.TransformerEncoder(...)
        self.head = nn.Linear(...)

        def forward(self, x, mask):
            x = self.embedding(x)
            x = self.encoder(x, src_key_padding_mask=mask)

            # Map each sequence to a global feature vector.
            value = -torch.inf
            x = x.masked_fill(mask, value)  # Ignore padding during selection.
            x = torch.max(...)
            
            return self.head(x)

We need to ensure that all model parameters are not affected by padding. Since the padding vectors are never selected by torch.max, this effectively eliminates the effect of padding to gradients (since the gradient of an element not selected by max is 0). Is my understanding correct?

Personal understanding here, might be helpful.

In transformer, the feed forward NN is applied embedding-wise.

If we have input shape (batch, seq, dmodel), then the weight of the first linear layer will be of shape (dmodel, hidden), projecting input embeddings into the dimension of hidden layer.

This means, the padding tokens, once disconnected by attention mask, will not affect other tokens during the feed forward NN.

As for the layer norm, I am not very sure. In my case, I never treated padding tokens differently, and the results weren’t bad. It seems that masking padding tokens is not necessary. Transformer can adapt to their presence itself.

@Chenze_W Appreciate the answer!

It is true that other tokens affect other tokens only during attention since other layers transform tokens identically and independently. However, if we don’t mask the contribution of padding to gradients, these dummy tokens would affect the gradients and as such the parameters’ update.

I never treated padding tokens differently, and the results weren’t bad. It seems that masking padding tokens is not necessary. Transformer can adapt to their presence itself.

Do you mask the loss during training?

1 Like

No, I’ve never tried to mask the loss and it didn’t seem to affect the performance. I should try maksing the padding tokens.

I guess this can be done by detaching the padding tokens from the computational graph between each transformer layer.

I’ll share the results when I actually manage to do this.