[Transformer] Difference between src_mask and src_key_padding_mask

I am having a difficult time in understanding transformers. Everything is getting clear bit by bit but one thing that makes my head scratch is
what is the difference between src_mask and src_key_padding_mask which is passed as an argument in forward function in both encoder layer and decoder layer.




Based on the PyTorch implementation source code (look at here) src_mask is what is called attn_mask in a MultiheadAttention module and src_key_padding_mask is equivalent to key_padding_mask in a MultiheadAttention module.

src_mask or attn_mask is a matrix used to represent which parts of the input sequence are allowed to be attended to (relative to the sequence itself). For example, in an autoregressive NN this would constitute a triangular matrix. Assuming the input sequence if of size N, this matrix would be of size N*N.

src_key_padding_mask or key_padding_mask is a matrix that is supposed to mark the padding areas that the layer should not attend to. It is of size batch_size*N.

found the the exact same question on SO:

more related links:

perhaps reading the docs for MHA is the best option:


@ptrblck sorry for straight up pinging you…but do you mind explaining the different btw src_mask vs src_key_padding_mask (or more generally _mask vs _key_padding_mask)?

My understanding is that everything wrt masking can be done with _mask e.g. if the length of the sequence is T then I can always just mark the padding with a T * T matrix…is that not right?

What are use cases for one vs the other?

I found this to be the most useful answer: