How should the padding mask be expanded in tthe implementation of a transformer from scratch?

FeryET · July 3, 2021, 7:06am

Hi.

I’m implementing a transformer from scratch and I was wondering how the padding mask should be generated. Should it make the generated attention vector in ScaledDotProduct Attention (the softmax of QK^T) filled with NaNs in only the columns or both the rows and columns? What is the correct approach here?