I want to use a transformer encoder for classification/regression over sequence of vectors, not necessarily text.
However, the problem is that these sequences have different length, meaning I need to pad in order to process efficiently with batches (unfortunately, nested tensors are not supported yet in TransformerEncoderLayer
).
My question is whether the src_key_padding_mask
is enough for proper backpropagation, so the padding doesn’t influece the gradient/update of parameters during training.
Masking is only applied during the attention operation, but how does this ensure that all layers of the transformer encoder, e.g. feedforward, layernorm etc, are not influenced by the padded inputs? Is the masking of loss what effectively ignores them?
Lets assume the following architecture. A sequence of vectors is first non-linearly projected, then processed by the encoder. Finally, the whole sequence is converted to a global feature vector and processed by a task-specific head. In code:
class Model(nn.Module):
def __init__(self, in_dim):
def super().__init__()
# Example of custom embedding.
self.embedding = nn.Sequential(
nn.Linear(in_dim, ...),
nn.LayerNorm(...),
nn.ReLU(),
)
self.encoder = nn.TransformerEncoder(...)
self.head = nn.Linear(...)
def forward(self, x, mask):
x = self.embedding(x)
x = self.encoder(x, src_key_padding_mask=mask)
# Map each sequence to a global feature vector.
value = -torch.inf
x = x.masked_fill(mask, value) # Ignore padding during selection.
x = torch.max(...)
return self.head(x)
We need to ensure that all model parameters are not affected by padding. Since the padding vectors are never selected by torch.max
, this effectively eliminates the effect of padding to gradients (since the gradient of an element not selected by max is 0). Is my understanding correct?