nn.Transformer explaination

@zhangguanheng66 Thanks for the explanation.
Just to check whether I understand correctly:we should provide the sequence padding mask in src_key_padding_mask and the dimension would be (N, S) where N is the batch size and S is the sequence length. I have confusion what will be content of src_key_padding_mask? will it be -inf/0 matrix or a boolean matrix with True/False?

padding mask is (N, S) with boolean True/False. Src_mask is (S, S) with float(’-inf’) and float(0.0). There is a note in pytorch nn.Transformer docs.

Hi @zhangguanheng66, @akashs, @LiHaibo

Can you please tell me what is the difference between the two sets of masks viz. ***_mask and ***_key_padding_mask?

From the documentation in the source code, this is what I could deduce. But I am not very confident and hence would really appreciate it if you can correct me:

  • src_mask, tgt_mask and memory_mask should be used when we want to apply the same mask to all the sequences in the given batch.
  • src_mask, tgt_mask, tgt_mask, tgt_mask and memory_mask, tgt_mask should be used when we want to specify different masks for different samples in the given batch. Also, the way you specify the masks is slightly different from the previous one.

My question is: Do both set of masks achieve the same purpose? And should we be using either one of them?

For instance, if you want to create a Seq2Seq Transformer model with both TransformerEncoder and TransformerDecoder, is it ok, if I only specify src_mask, tgt_mask and memory_mask?

@shahensha To your questions, key_padding_mask controls how which batch items are allowed to attend to certain key positions. This is most commonly used to avoid attending to padding elements. attn_mask controls how query positions are allowed to attend to key positions. This is useful for doing left-to-right (causal) attention, where we enforce that query positions are only allowed to attend to keys to their left.


Thank you @zhangguanheng66

I finally understood all the different masks in the API. But for some reason, my system is not able to work well at inference time. The loss goes does nicely, but at inference it just produces garbage values.

I’m having a hard time understanding how to use nn.Transformer, too, even after reading through this thread, the tutorial, this github issue, and the example language model. My model seems to do nothing but copy the target sequence, no matter what I do.

The task is to predict the title of an article, given a sentence from the article. It’s just a test task for a similar task I would like to do. The sentence and the title are both of varying length. To facilitate batching, I use data loader collate_fn to pad every sentence in a batch to the length of the longest sentence in the batch. Same for title. While using nn.Transformer, I make the sentence the src, and the title the tgt.

I include a padding mask for both src and tgt, which has True values wherever I padded a sentence. I also include a tgt_mask generated by generate_square_subsequent_mask to make it so that the decoder can’t look ahead in a sequence while it’s predicting. Since the model was still copying everything, I also included a square mask for the src, but that didn’t help anything.

I feel that I’m missing something very obvious. Can anybody help?

Looping in @zhangguanheng66 who seems to know a lot about this.

For your first part, it seems that you are not setting up attn_mask correctly.

Wow, thanks for the quick reply.

Which attn_mask is that? Both source and target masks should be pretty standard

Here’s how I’m using it, where self.base is just a model that returns embeddings for inp (src) and tgt, and where src_mask and tgt_mask are the standard upper triangle matrices, and src/tgt_key_padding_mask are as I described previously:

inp_emb, tgt_emb = self.base(inputs, targets)
# We get inputs and targets in (N, S, E) and (N, T, E), and nn.Transformer requires (S, N, E) and (T, N, E), so we transpose them
inp_emb = inp_emb.transpose(0, 1)
tgt_emb = tgt_emb.transpose(0, 1)

hdn = self.transformer(inp_emb, tgt_emb, src_mask=src_mask, tgt_mask=tgt_mask, src_key_padding_mask=inp_padding_mask, tgt_key_padding_mask=tgt_padding_mask)

out = self.head(hdn)
out = out.transpose(0, 1)

loss_fct = nn.CrossEntropyLoss()
out_view = out.contiguous().view(-1, self.vocab_size)
tgt_view = targets.view(-1)
loss = loss_fct(out_view, tgt_view)

Could the transposes be throwing it off?

Well I was right, I was indeed missing something very obvious. To anyone who comes after me and has a similar problem, the reason why my network was only copying results was because my training strategy was wrong. I was passing in targets to the decoder and calculating loss based on how similar what it produced was to those targets. If you think about it, I was asking the decoder to behave like an auto-encoder, to reproduce exactly what I passed in. That’s not very difficult for a transformer decoder to do, so it learned to copy very quickly, even with masks. Doing this also makes it impossible to perform inference, since the decoder never learned how to generate anything new.

How, you might ask, do you fix this? The solution for me was a couple steps:

  1. To add special start and end tokens to every target; e.g. [ 'h', 'e', 'l', 'l', 'o'] became [ <start>, 'h', 'e', 'l', 'l', 'o', <end>] (since it’s a character model, my start and end tokens are actually unicode tokens)
  2. To add an additional loop in the training loop that starts with a target of length 1 and passes incrementally larger targets until it passes the entire target. Then calculate loss based on how similar the output is to the target shifted left by one. (I also do backpropagation each time – not sure if that’s correct or if they should be aggregated over the whole sub-loop.) E.g. [<start>] goes in, ['h'] is expected. Then [<start>, 'h'] goes in, ['h', e'] is expected. And so on. The last iteration is [<start>, 'h', 'e', 'l', 'l', 'o' ], with [ 'h', 'e', 'l', 'l', 'o', <end>] expected. This particular way of training is called teacher forcing. It also sets us up nicely to perform inference.

Inference (answering this issue now) then happens by simply passing the hidden state from the encoder and the [<start>] token to the decoder. Since the model has been trained to output a single token when a single <start> token is passed in, it should output (hopefully) the correct first token of our output sequence. Then, we can take that token and append it to our <start> token, and pass in that as input to the decoder. Now it should generate two tokens. We repeat this process until the <end> token is generated, and then we stop. This is known as greedy decoding. Both teacher forcing and greedy decoding are used to train Google’s T5, so they’re viable today. There is, however, a method called beam search that gets better results, but takes much longer to generate.


If you switch to transformer encoder and have the triangle src_mask, you should be able to predict the next word, just like this example

With just an encoder, wouldn’t the size of the output be limited to the size of src? That is, if I have a sentence the cow jumped over the moon (28 characters), then the maximum length of the predicted title is 28 characters. But with the encoder-decoder, the sentence can be any length and the output can be any length, which I want.

Thanks for your helpful comments here! I am very grateful to see someone who knows what to do. I would like to apply left to right causal attention so that I get a representation for each timepoint in my time series that I can use to make predictions. Do you know of any successful examples of applying this left to right causal attention?

For transformer encoder, the output sequence has same size as the input (a.k.a. src)

Do you think the word language modeling task supports this?

It means the mask attention is still not setup properly. Could you send a code snippet?

I’m still confused with trg_key_paddding_mask. As discribed in the doc, trg_key_paddding_mask’s dimension is (N,S). In translation tasks, the decoder inputs need to mask future words and pad. However, It seems that trg_key_paddding_mask can only masks pad. Does the nn.Transformerdecoder mask the future word in the source code(I have’t found that)?. May be trg_mask (T, T) can mask the future word but it dosen’t work when the input is a batch.

To mask future tokens, you should use src_mask, tgt_mask, memory_mask

Hi, I do not understand why both src and tgt are required for nn.Transformer.

Let’s say for machine translation use case, I understand that during training, src and tgt are 2 different languages. But during testing, given an input, predict an output, we do not have tgt. If so, what should the tgt input? The start of sentence token (e.g. <sos>)?

1 Like

Yes. The tgt input will be, as you rightly said, <sos>.

Thanks, but src_mask only works when the input is a sequence not batch.