nn.Transformer explaination

I am having hard time making the new transformer work. Following code has unexpected(to me) output. Gradients for the model parameters are zeros and so the optimizer step is of no use. The documentation for this module is not as explanatory as other like RNN. If someone can explain me how to make a encoder decoder Transformer work that that will be great.

code

import torch
print(torch.__version__)
X = torch.tensor([[[95.0]], [[100.0]], [[105.0]], [[110.0]], [[115.0]]])
y = torch.tensor([[[120.0]]])
print(X.shape, y.shape)
print(X.requires_grad, y.requires_grad)
model = torch.nn.Transformer(d_model=1, nhead=1, dim_feedforward=100, dropout=0)
criterion = torch.nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1)
parms = [j for j in model.parameters()][:3]
model.train()
optimizer.zero_grad()
y_pred = model(X,y)
print(y_pred)
print(y)
print(y_pred.requires_grad)
print(y_pred._grad)
loss = criterion(y_pred, y)
print(loss)
for i in parms: print(i._grad)
loss.backward()
print(y_pred._grad)
for i in parms: print(i._grad)

output

1.2.0
torch.Size([5, 1, 1]) torch.Size([1, 1, 1])
False False
tensor([[[-5.7748e-11]]], grad_fn=<NativeLayerNormBackward>)
tensor([[[120.]]])
True
None
tensor(14400., grad_fn=<MseLossBackward>)
None
None
None
None
tensor([[0.],
        [0.],
        [0.]])
tensor([0., 0., 0.])
tensor([[0.]])

expected output non zero gradients for the model parameters
I am trying to train a encoder decoder to complete a sequence of numbers for example
input = 95, 100, 105, 110, 115 the corresponding output = 120.
I don’t know much about transformer that’s why I tried to make this.
also the example on the documentation site is not making sense to me

>>> transformer_model = nn.Transformer(src_vocab, tgt_vocab)
>>> transformer_model = nn.Transformer(src_vocab, tgt_vocab, nhead=16, num_encoder_layers=12)

what is src_vocab and tgt_vocab ? I know what they mean but which parameters of the nn.Transformer constructor are the corresponding to?

1 Like

I’m also having a problem here. Can someone who is kind and smart dudes tell about this plz.

‘what is src_vocab and tgt_vocab ? I know what they mean but which parameters of the nn.Transformer constructor are the corresponding to?’

I’m having the same problem, but for the example part i guess it is a mistake from their side
nn.transformer doesn’t take source and target vocab size as it is only implementing the transformer part without the embeddings layer on the input data and without the linear layer on the output of the decoder,
in order to make it work d_model will be your embedding size and call an embedding layer on the source and on the target and the output of the transformer should pass through a linear that gets you the target vocab size

self.embed_src = nn.Embedding(src_vocab, emb_dim)
self.embed_trg = nn.Embedding(trg_vocab, emb_dim)
self.model = nn.Transformer( d_model = emb_dim,nhead=heads, self.num_encoder_layers=N, num_decoder_layers=N)
self.out_linear = nn.Linear(emb_dim, trg_vocab)

for the forward function it should be

src = self.embed_src(src) 
trg = self.embed_trg(trg)
output = self.model(src, trg)
output = self.out(output)
1 Like

There is a typo in the doc and the PR is going to fix it. You don’t need src_vocab and tgt_vocab to initiate the transformer module. Here is a simple example:

import torch
import torch.nn as nn
transformer_model = nn.Transformer(nhead=16, num_encoder_layers=12)
src = torch.rand((10, 32, 512))
tgt = torch.rand((20, 32, 512))
out = transformer_model(src, tgt)

A full example to apply nn.Transformer for the word language model could be found here. The example applies both RNN and transformer modules, respectively.

1 Like

Can someone explain the src and the src_mask shape of transformer.
For example, I have a tokenized text sentence with max_len=128.
This sentence go through a nn.Embedding(src_vocab=5000, emb_dim=128)
The output of the embedding will be a tensor with shape (N, 128,128), where N=batch_size.
The transformer docs tell that src input and src_mask have shape:
src: (S,N,E) and src_mask: (S,S)
where S is the source sequence length, N is the batch size, E is the feature number.
Should I do some changes on embedding output to use as input on transformer layer?
I’m a bit confused :confused: .

1 Like

S is the number of elements; N is the number of batches; E is the number of features (a.k.a. embedding dimension in your case).

If you send input (S, N, 5000) to embedding layer, the output will be in the shape of (S, N, 128). Then, you don’t need to make any changes in order to feed them to the transformer layer. The src_mask is just a square matrix which is used to filter the attention weights.

See example here

1 Like

Thanks for your reply!!
I’m a bit confusing with this embedding layer output. I’ll try explain:

My sentences have size: torch.size([128]).
So, if I’m using a batch size of 32 the tensor will have size:
torch.size([32,128]) - > shape = (N, S)
When I send this tensor to the embedding layer (with src_vocab = 5000 and emb_dim=128) the output will have size:
torch.tensor([32, 128, 128]) -> shape = (N, S, E).
This is confusing me, should I permute first and second dimensions to become shape = (S, N, E) ?

1 Like

yeap. You should transpose your input after embedding layer.

For nn.Transformer, we chose the shape to be (S, N, E) and some NLP people use (N, S, E). There is nothing right or wrong and the switch between two shapes is fine.

2 Likes

Thank you!!! :smiley:

1 Like

hi, I’m a bit confusing with src_mask and src_key_padding_mask, the explanation on pytorch docs are
src_mask – the additive mask for the src sequence (optional).
src_key_padding_mask – the ByteTensor mask for src keys per batch (optional).
In my opinion, src_mask 's dimension is (S,S), and S is the max source length in batch, so i need to send input src_mask (N,S,S) to the Transformer.I don’t know if i understand that correctly. I don’t understand the src_key_padding_mask’s explanation on website docs, this is confusing me.
for the provided example code ,
output = transformer_model(src, tgt, src_mask=src_mask, tgt_mask=tgt_mask)
set the [src/tgt/memory]_key_padding_mask are None as default, I’m a little confused about this operation.

@LiHaibo
First, both masks work on the dot product of query and key in the “Scaled Dot-Product Attention” layer.
src_mask is working on the matrix with a dimension of (S, S) and add ‘-inf’ to a single position. src_key_padding_mask is more like a padding marker, which masks a specific tokens in the src sequence (a.k.a. the entire column/row of the attention matrix is set to ‘-inf’).

@zhangguanheng66 Thanks for the explanation.
Just to check whether I understand correctly:we should provide the sequence padding mask in src_key_padding_mask and the dimension would be (N, S) where N is the batch size and S is the sequence length. I have confusion what will be content of src_key_padding_mask? will it be -inf/0 matrix or a boolean matrix with True/False?

padding mask is (N, S) with boolean True/False. Src_mask is (S, S) with float(’-inf’) and float(0.0). There is a note in pytorch nn.Transformer docs.