I am having a difficult time in understanding transformers. Everything is getting clear bit by bit but one thing that makes my head scratch is
what is the difference between src_mask and src_key_padding_mask which is passed as an argument in forward function in both encoder layer and decoder layer.
Based on the PyTorch implementation source code (look at here) src_mask is what is called attn_mask in a MultiheadAttention module and src_key_padding_mask is equivalent to key_padding_mask in a MultiheadAttention module.
src_mask or attn_mask is a matrix used to represent which parts of the input sequence are allowed to be attended to (relative to the sequence itself). For example, in an autoregressive NN this would constitute a triangular matrix. Assuming the input sequence if of size N, this matrix would be of size N*N.
src_key_padding_mask or key_padding_mask is a matrix that is supposed to mark the padding areas that the layer should not attend to. It is of size batch_size*N.
@ptrblck sorry for straight up pinging you…but do you mind explaining the different btw src_mask vs src_key_padding_mask (or more generally _mask vs _key_padding_mask)?
My understanding is that everything wrt masking can be done with _mask e.g. if the length of the sequence is T then I can always just mark the padding with a T * T matrix…is that not right?
Still not clear to me. The comment/document claims they both are to block part of the data which sounds like redundant parameters. For whatever src_key_padding_mask is doing, src_mask can do as well. On the other hand, if they are different why the documentation says:
at most one of src_mask and src_key_padding_mask is passed
If they are different, why we can’t pass both src_mask and src_key_padding_mask to forward() ?
@Zheng_Han I did some further research that might be helpful for you as well because I had the same question.
For your first point, in the documentation where it says:
at most one of src_mask and src_key_padding_mask is passed
That part of the documentation is referring specifically to the ‘fast path’ of the model execution at inference time. It is not saying that you can’t pass both of them, it is saying that you can’t pass both of them AND expect ‘fast path’ to be executed for the model.
Also furthermore, if anyone is interested in the real difference between ‘src_mask’ and ‘src_key_padding_mask’.
The main difference is that ‘src_key_padding_mask’ looks at masks applied to entire tokens. So for example, when you set a value in the mask Tensor to ‘True’, you are essentially saying that the token is a ‘pad token’ and should not be attended by any other tokens.
However, if you wanted to do the same thing with ‘src_mask’, you would have to create a larger matrix and you would have to mask out ALL the connections between the ‘pad token’ and all other tokens.
So yes, you can accomplish ‘src_key_padding_mask’ with just ‘src_mask’ alone. However, it would require more work and would be more complicated. The src_key_padding_mask is a simple way to shortcut that.