Effect of padded sequences in NLP Transformer

I am writing my own implementation of the Transformer according to the paper Attention is All You Need. I have never worked with sequence data, and I am getting confused on the padded inputs. Does the padded values have an effect on the output after getting zeroed by the Embedding layer, when bias is applied to them? If yes, should I apply padding mask after each computation to mask the padded values?

In natural language sentences have variable lengths, and training in machine learning requires batching into matrices where variable length is not allowed. The logical decision would be to pad the shorter sentences with 0’s at the end, feed them into the Embedding layer that has pad_idx = 0, and use the output for further processing. However, the linear layers (Feed Forward) in the Transformer contain bias which can add weight to those 0’s. My intuition tells me that it does not affect the weights on back-propagation since the derivative of a constant is zero, but I am wondering if this might have a negative impact on the output of the network.