[Transformer] Difference between src_mask and src_key_padding_mask

Ty4Reading · April 23, 2024, 6:43pm

@Zheng_Han I did some further research that might be helpful for you as well because I had the same question.

For your first point, in the documentation where it says:

at most one of src_mask and src_key_padding_mask is passed

That part of the documentation is referring specifically to the ‘fast path’ of the model execution at inference time. It is not saying that you can’t pass both of them, it is saying that you can’t pass both of them AND expect ‘fast path’ to be executed for the model.

Also furthermore, if anyone is interested in the real difference between ‘src_mask’ and ‘src_key_padding_mask’.

The main difference is that ‘src_key_padding_mask’ looks at masks applied to entire tokens. So for example, when you set a value in the mask Tensor to ‘True’, you are essentially saying that the token is a ‘pad token’ and should not be attended by any other tokens.

However, if you wanted to do the same thing with ‘src_mask’, you would have to create a larger matrix and you would have to mask out ALL the connections between the ‘pad token’ and all other tokens.

So yes, you can accomplish ‘src_key_padding_mask’ with just ‘src_mask’ alone. However, it would require more work and would be more complicated. The src_key_padding_mask is a simple way to shortcut that.