How should the padding mask be expanded in tthe implementation of a transformer from scratch?


I’m implementing a transformer from scratch and I was wondering how the padding mask should be generated. Should it make the generated attention vector in ScaledDotProduct Attention (the softmax of QK^T) filled with NaNs in only the columns or both the rows and columns? What is the correct approach here?