nn.Transformer explaination

Can someone explain the src and the src_mask shape of transformer.
For example, I have a tokenized text sentence with max_len=128.
This sentence go through a nn.Embedding(src_vocab=5000, emb_dim=128)
The output of the embedding will be a tensor with shape (N, 128,128), where N=batch_size.
The transformer docs tell that src input and src_mask have shape:
src: (S,N,E) and src_mask: (S,S)
where S is the source sequence length, N is the batch size, E is the feature number.
Should I do some changes on embedding output to use as input on transformer layer?
I’m a bit confused :confused: .

2 Likes

S is the number of elements; N is the number of batches; E is the number of features (a.k.a. embedding dimension in your case).

If you send input (S, N, 5000) to embedding layer, the output will be in the shape of (S, N, 128). Then, you don’t need to make any changes in order to feed them to the transformer layer. The src_mask is just a square matrix which is used to filter the attention weights.

See example here

1 Like

Thanks for your reply!!
I’m a bit confusing with this embedding layer output. I’ll try explain:

My sentences have size: torch.size([128]).
So, if I’m using a batch size of 32 the tensor will have size:
torch.size([32,128]) - > shape = (N, S)
When I send this tensor to the embedding layer (with src_vocab = 5000 and emb_dim=128) the output will have size:
torch.tensor([32, 128, 128]) -> shape = (N, S, E).
This is confusing me, should I permute first and second dimensions to become shape = (S, N, E) ?

1 Like

yeap. You should transpose your input after embedding layer.

For nn.Transformer, we chose the shape to be (S, N, E) and some NLP people use (N, S, E). There is nothing right or wrong and the switch between two shapes is fine.

2 Likes

Thank you!!! :smiley:

1 Like

hi, I’m a bit confusing with src_mask and src_key_padding_mask, the explanation on pytorch docs are
src_mask – the additive mask for the src sequence (optional).
src_key_padding_mask – the ByteTensor mask for src keys per batch (optional).
In my opinion, src_mask 's dimension is (S,S), and S is the max source length in batch, so i need to send input src_mask (N,S,S) to the Transformer.I don’t know if i understand that correctly. I don’t understand the src_key_padding_mask’s explanation on website docs, this is confusing me.
for the provided example code ,
output = transformer_model(src, tgt, src_mask=src_mask, tgt_mask=tgt_mask)
set the [src/tgt/memory]_key_padding_mask are None as default, I’m a little confused about this operation.

@LiHaibo
First, both masks work on the dot product of query and key in the “Scaled Dot-Product Attention” layer.
src_mask is working on the matrix with a dimension of (S, S) and add ‘-inf’ to a single position. src_key_padding_mask is more like a padding marker, which masks a specific tokens in the src sequence (a.k.a. the entire column/row of the attention matrix is set to ‘-inf’).

1 Like

@zhangguanheng66 Thanks for the explanation.
Just to check whether I understand correctly:we should provide the sequence padding mask in src_key_padding_mask and the dimension would be (N, S) where N is the batch size and S is the sequence length. I have confusion what will be content of src_key_padding_mask? will it be -inf/0 matrix or a boolean matrix with True/False?

padding mask is (N, S) with boolean True/False. Src_mask is (S, S) with float(’-inf’) and float(0.0). There is a note in pytorch nn.Transformer docs.

Hi @zhangguanheng66, @akashs, @LiHaibo

Can you please tell me what is the difference between the two sets of masks viz. ***_mask and ***_key_padding_mask?

From the documentation in the source code, this is what I could deduce. But I am not very confident and hence would really appreciate it if you can correct me:

  • src_mask, tgt_mask and memory_mask should be used when we want to apply the same mask to all the sequences in the given batch.
  • src_mask, tgt_mask, tgt_mask, tgt_mask and memory_mask, tgt_mask should be used when we want to specify different masks for different samples in the given batch. Also, the way you specify the masks is slightly different from the previous one.

My question is: Do both set of masks achieve the same purpose? And should we be using either one of them?

For instance, if you want to create a Seq2Seq Transformer model with both TransformerEncoder and TransformerDecoder, is it ok, if I only specify src_mask, tgt_mask and memory_mask?

@shahensha To your questions, key_padding_mask controls how which batch items are allowed to attend to certain key positions. This is most commonly used to avoid attending to padding elements. attn_mask controls how query positions are allowed to attend to key positions. This is useful for doing left-to-right (causal) attention, where we enforce that query positions are only allowed to attend to keys to their left.

2 Likes

Thank you @zhangguanheng66

I finally understood all the different masks in the API. But for some reason, my system is not able to work well at inference time. The loss goes does nicely, but at inference it just produces garbage values.

I’m having a hard time understanding how to use nn.Transformer, too, even after reading through this thread, the tutorial, this github issue, and the example language model. My model seems to do nothing but copy the target sequence, no matter what I do.

The task is to predict the title of an article, given a sentence from the article. It’s just a test task for a similar task I would like to do. The sentence and the title are both of varying length. To facilitate batching, I use data loader collate_fn to pad every sentence in a batch to the length of the longest sentence in the batch. Same for title. While using nn.Transformer, I make the sentence the src, and the title the tgt.

I include a padding mask for both src and tgt, which has True values wherever I padded a sentence. I also include a tgt_mask generated by generate_square_subsequent_mask to make it so that the decoder can’t look ahead in a sequence while it’s predicting. Since the model was still copying everything, I also included a square mask for the src, but that didn’t help anything.

I feel that I’m missing something very obvious. Can anybody help?

Looping in @zhangguanheng66 who seems to know a lot about this.

For your first part, it seems that you are not setting up attn_mask correctly.

Wow, thanks for the quick reply.

Which attn_mask is that? Both source and target masks should be pretty standard

Here’s how I’m using it, where self.base is just a model that returns embeddings for inp (src) and tgt, and where src_mask and tgt_mask are the standard upper triangle matrices, and src/tgt_key_padding_mask are as I described previously:

inp_emb, tgt_emb = self.base(inputs, targets)
# We get inputs and targets in (N, S, E) and (N, T, E), and nn.Transformer requires (S, N, E) and (T, N, E), so we transpose them
inp_emb = inp_emb.transpose(0, 1)
tgt_emb = tgt_emb.transpose(0, 1)

hdn = self.transformer(inp_emb, tgt_emb, src_mask=src_mask, tgt_mask=tgt_mask, src_key_padding_mask=inp_padding_mask, tgt_key_padding_mask=tgt_padding_mask)

out = self.head(hdn)
out = out.transpose(0, 1)

loss_fct = nn.CrossEntropyLoss()
out_view = out.contiguous().view(-1, self.vocab_size)
tgt_view = targets.view(-1)
loss = loss_fct(out_view, tgt_view)

Could the transposes be throwing it off?

Well I was right, I was indeed missing something very obvious. To anyone who comes after me and has a similar problem, the reason why my network was only copying results was because my training strategy was wrong. I was passing in targets to the decoder and calculating loss based on how similar what it produced was to those targets. If you think about it, I was asking the decoder to behave like an auto-encoder, to reproduce exactly what I passed in. That’s not very difficult for a transformer decoder to do, so it learned to copy very quickly, even with masks. Doing this also makes it impossible to perform inference, since the decoder never learned how to generate anything new.

How, you might ask, do you fix this? The solution for me was a couple steps:

  1. To add special start and end tokens to every target; e.g. [ 'h', 'e', 'l', 'l', 'o'] became [ <start>, 'h', 'e', 'l', 'l', 'o', <end>] (since it’s a character model, my start and end tokens are actually unicode tokens)
  2. To add an additional loop in the training loop that starts with a target of length 1 and passes incrementally larger targets until it passes the entire target. Then calculate loss based on how similar the output is to the target shifted left by one. (I also do backpropagation each time – not sure if that’s correct or if they should be aggregated over the whole sub-loop.) E.g. [<start>] goes in, ['h'] is expected. Then [<start>, 'h'] goes in, ['h', e'] is expected. And so on. The last iteration is [<start>, 'h', 'e', 'l', 'l', 'o' ], with [ 'h', 'e', 'l', 'l', 'o', <end>] expected. This particular way of training is called teacher forcing. It also sets us up nicely to perform inference.

Inference (answering this issue now) then happens by simply passing the hidden state from the encoder and the [<start>] token to the decoder. Since the model has been trained to output a single token when a single <start> token is passed in, it should output (hopefully) the correct first token of our output sequence. Then, we can take that token and append it to our <start> token, and pass in that as input to the decoder. Now it should generate two tokens. We repeat this process until the <end> token is generated, and then we stop. This is known as greedy decoding. Both teacher forcing and greedy decoding are used to train Google’s T5, so they’re viable today. There is, however, a method called beam search that gets better results, but takes much longer to generate.

2 Likes

If you switch to transformer encoder and have the triangle src_mask, you should be able to predict the next word, just like this example

With just an encoder, wouldn’t the size of the output be limited to the size of src? That is, if I have a sentence the cow jumped over the moon (28 characters), then the maximum length of the predicted title is 28 characters. But with the encoder-decoder, the sentence can be any length and the output can be any length, which I want.

Thanks for your helpful comments here! I am very grateful to see someone who knows what to do. I would like to apply left to right causal attention so that I get a representation for each timepoint in my time series that I can use to make predictions. Do you know of any successful examples of applying this left to right causal attention?

For transformer encoder, the output sequence has same size as the input (a.k.a. src)