I have a quesiton:
Let’s suppose we are training Bert from scratch.
When we pass the embeddings as inputs to BertModel the three tensors referring to token, segment, and position must have the same padding value (i.e. 0)?
If the padding value for position/segment is different from the value used to pad the token sequence (e.g. 12), what would be the issue with this during the training?
Thanks in advance