What is the difference between src_mask and src_key_padding_mask

Here is a simple example:

> q = torch.randn(5, 1, 10) # source sequence length 5, batch size 1, embedding size 10
> 
> def src_mask(sz):
>       mask = (torch.triu(torch.ones(sz, sz)) == 1).transpose(0, 1) # lower triangular 
>       mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
>       return mask
> attn_w1 = nn.MultiheadAttention(10, 1, attn_mask=src_mask(5))[1]
> 
> src_key_padding_mask = torch.tensor([[0, 0, 0, 1, 1]]).bool()
> attn_w2 = attn(q, q, q, key_padding_mask=src_key_padding_mask)[1]

These two mask arguments are confusing.

I understand attention mask requres size [batch, seq_len, seq_len] and the other [batch, seq_len] for padding mask,
but can we use them interchangeably, as for padding mask, we can also design a mask to fit the attention map size [seq_len, seq_len]?

Also, is src_key_padding_mask only working on key matrix? In self-attention case, should we mask the other two, query and value, also?