How to get memory_mask for nn.TransformerDecoder

xdwang0726 · November 8, 2019, 8:47pm

The generate_square_subsequent_mask function in nn.Transformer can only generate square masks, but memory_mask requires the dimension (T, S). I am wondering is there a built in function in transformer?? Thank you!

zhangguanheng66 · November 11, 2019, 5:18pm

If you take a look at the source code of generate_square_subsequent_mask function, you will see how it work. It’s pretty simple.

xdwang0726 · November 11, 2019, 5:40pm

Hi, Thank you for your reply! I am wondering the generate_square_masks can only generate square mask, but according to the nn.Transformer documentation, memory_mask requres shape as (T, S). So I am wondering do I need a customized function to generate it? Thank you!!

suhara · November 20, 2019, 2:06am

I don’t think so. You don’t need to use memory_mask unless you want to prevent the decoder from attending some tokens in the input sequence, and the original Transformer didn’t use it in the first place because the decoder should be aware of the entire input sequence for any token in the output sequence. The same thing can be said to the input sequence (i.e., src_mask.)

In the PyTorch language, the original Transformer settings are src_mask=None and memory_mask=None, and for tgt_mask=generate_square_subsequent_mask(T).

Again, memory_mask is used only when you don’t want to let the decoder attend certain tokens in the input sequence. That is why the input shape is (S, T) (where S is input sequence lenegth and T is output sequence length.)

If you still want to create a mask, let’s say, so that the decoder does not attend the future positions in the encoder, I’d consider using torch.concat() to create such a mask. For example,

# if S > T

>>> torch.cat([model.generate_square_subsequent_mask(T),
               torch.zeros(T - S, T)], 0)
			   
tensor([[0., -inf, -inf, -inf, -inf],
        [0., 0., -inf, -inf, -inf],
        [0., 0., 0., -inf, -inf],
        [0., 0., 0., 0., -inf],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],				
        [0., 0., 0., 0., 0.]])

# if T > S
>>> torch.cat([model.generate_square_subsequent_mask(S),
               torch.zeros(S, T - S)], 1)
			   
tensor([[0., -inf, -inf, -inf, -inf, 0., 0.],
        [0., 0., -inf, -inf, -inf, 0., 0.],
        [0., 0., 0., -inf, -inf, 0., 0.],
        [0., 0., 0., 0., -inf, 0., 0.],
        [0., 0., 0., 0., 0., 0., 0.]])

Hope it helps.

xdwang0726 · November 20, 2019, 4:41am

Thank you for your reply! For my understanding, there are two masks in the MultiHeadAttention, one is attn_mask and the other is key_pad_mask. In the decoder, the first Multi-head self-attention takes tgt_mask as attn_mask which is used to prevent decoder see its subsequent tokens, and key_pad_mak is the padding mask for target sequence. The second multi-head attention is contextual attention, which leverage the output from encoder and decoder self-attention. I don’t quite understand attn_mask and key_pad_mask in the contextual attention, and wondering whether could you explain those two masks? Thank you!

suhara · November 20, 2019, 6:04am

The second multi-head attention is contextual attention, which leverage the output from encoder and decoder self-attention. I don’t quite understand attn_mask and key_pad_mask in the contextual attention, and wondering whether could you explain those two masks? Thank you!

If “the contextual attention” means the encoder-decoder attention of the Transformer, they (attn_mask and key_pad_mask) are memory_mask and memory_key_padding_mask.

xdwang0726 · November 20, 2019, 5:56pm

Thanks for your reply! Yes, I understand in the encoder-decoder attention, attn_mask is memory_mask and key_pad_mask is memory_key_padding_mask, but I didnt quite get how these two mask looks like and how they work. Also, I am confused are theses two masks mandatory when we use nn.TransformerDecoder? (In the pytorch documentation, it shows optional). I am also wondering when inference process, are these masks needed?

suhara · November 21, 2019, 5:38am

They are optional as their default values are None.

It might be helpful for you to understand that there are three types of attention in the Transformer:

1. encoder self-attention (no mask needed)
1. encoder-decoder attention (no mask needed)
1. decoder self-attention (mask needed)

In some sense, attention is a way to calculate a vector representation of a token using its context. (2) encoder-decoder attention calculates a vector representation of each token in the (generated) output sequence based on all tokens in the input sequence. Thus, it’s called encoder-decoder attention.

Unless you don’t want to use some certain token(s) in the input sequence for some output token(s), there is no reason to mask the information from the input sequence. Therefore, the default mask is None.

The following article visually explains how encoder/decoder self-attention and encoder-decoder attention work in the Transformer.

xdwang0726 · November 21, 2019, 6:20pm

Thank you! Your explaination is very clear and is very helpful! I still have a contains that I followed your suggestion and trained a summarization model using nn.TransformerDecoder (I used BERT as an encoder), the training seems go well (training can generate meaningful sentences), but the inference process only generate the same token (repeatly). I am wondering does the training and inference different in settings? (The only difference I did for training and inference is that in the inference, I send one token at the very begining and add the generated token to the decoder for the next prediction.)

agnes · December 10, 2019, 2:19pm

Did you find a solution ?

zhangguanheng66 · December 12, 2019, 9:33pm

If you see a repeating token from transformer during inference, you may need a mask during training. Your model are not effectively learning anything.

mathematicsofpaul · August 7, 2020, 4:27pm

What could be possible reasons why my model is delivering repeated tokens during training and effectively at inference?

shamoons · November 18, 2020, 2:12am

Same problem - would love any input or ideas?