[Transformer] Difference between src_mask and src_key_padding_mask

sahiluppal2k · June 3, 2020, 9:30am

I am having a difficult time in understanding transformers. Everything is getting clear bit by bit but one thing that makes my head scratch is
what is the difference between src_mask and src_key_padding_mask which is passed as an argument in forward function in both encoder layer and decoder layer.

https://pytorch.org/docs/master/_modules/torch/nn/modules/transformer.html#Transformer

h-shad · March 25, 2021, 3:52pm

Hi.

Based on the PyTorch implementation source code (look at here) src_mask is what is called attn_mask in a MultiheadAttention module and src_key_padding_mask is equivalent to key_padding_mask in a MultiheadAttention module.

src_mask or attn_mask is a matrix used to represent which parts of the input sequence are allowed to be attended to (relative to the sequence itself). For example, in an autoregressive NN this would constitute a triangular matrix. Assuming the input sequence if of size N, this matrix would be of size N*N.

src_key_padding_mask or key_padding_mask is a matrix that is supposed to mark the padding areas that the layer should not attend to. It is of size batch_size*N.

Brando_Miranda · July 14, 2021, 10:35pm

found the the exact same question on SO:

more related links:

perhaps reading the docs for MHA is the best option:

https://pytorch.org/docs/master/generated/torch.nn.MultiheadAttention.html#torch.nn.MultiheadAttention

Brando_Miranda · July 14, 2021, 10:40pm

@ptrblck sorry for straight up pinging you…but do you mind explaining the different btw src_mask vs src_key_padding_mask (or more generally _mask vs _key_padding_mask)?

My understanding is that everything wrt masking can be done with _mask e.g. if the length of the sequence is T then I can always just mark the padding with a T * T matrix…is that not right?

What are use cases for one vs the other?

Brando_Miranda · July 15, 2021, 3:38pm

I found this to be the most useful answer:

Zheng_Han · June 23, 2023, 10:04pm

Still not clear to me. The comment/document claims they both are to block part of the data which sounds like redundant parameters. For whatever src_key_padding_mask is doing, src_mask can do as well. On the other hand, if they are different why the documentation says:

at most one of src_mask and src_key_padding_mask is passed

If they are different, why we can’t pass both src_mask and src_key_padding_mask to forward() ?

Ty4Reading · April 23, 2024, 6:43pm

@Zheng_Han I did some further research that might be helpful for you as well because I had the same question.

For your first point, in the documentation where it says:

at most one of src_mask and src_key_padding_mask is passed

That part of the documentation is referring specifically to the ‘fast path’ of the model execution at inference time. It is not saying that you can’t pass both of them, it is saying that you can’t pass both of them AND expect ‘fast path’ to be executed for the model.

Also furthermore, if anyone is interested in the real difference between ‘src_mask’ and ‘src_key_padding_mask’.

The main difference is that ‘src_key_padding_mask’ looks at masks applied to entire tokens. So for example, when you set a value in the mask Tensor to ‘True’, you are essentially saying that the token is a ‘pad token’ and should not be attended by any other tokens.

However, if you wanted to do the same thing with ‘src_mask’, you would have to create a larger matrix and you would have to mask out ALL the connections between the ‘pad token’ and all other tokens.

So yes, you can accomplish ‘src_key_padding_mask’ with just ‘src_mask’ alone. However, it would require more work and would be more complicated. The src_key_padding_mask is a simple way to shortcut that.