Positional encoding at padded values

Hi there, we are using a transformer architecture for knowledge tracing purposes (as per the implementation described in this paper: https://arxiv.org/pdf/2010.12042.pdf)

We are using left padding on sequences shorter than our max sequence length, and were wondering if we need a mask for the positional encoding function too, because the positional encoding is applied to the encoder input, resulting in non zero values before being inputted into the transformer.

Doesn’t this mean that these padded values will have some effect on prediction now, when they should not be affecting the prediction e.g. of values later on in the sequence?

Summary

This text will be hidden