nn.Transformer explaination

I am having hard time making the new transformer work. Following code has unexpected(to me) output. Gradients for the model parameters are zeros and so the optimizer step is of no use. The documentation for this module is not as explanatory as other like RNN. If someone can explain me how to make a encoder decoder Transformer work that that will be great.


import torch
X = torch.tensor([[[95.0]], [[100.0]], [[105.0]], [[110.0]], [[115.0]]])
y = torch.tensor([[[120.0]]])
print(X.shape, y.shape)
print(X.requires_grad, y.requires_grad)
model = torch.nn.Transformer(d_model=1, nhead=1, dim_feedforward=100, dropout=0)
criterion = torch.nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1)
parms = [j for j in model.parameters()][:3]
y_pred = model(X,y)
loss = criterion(y_pred, y)
for i in parms: print(i._grad)
for i in parms: print(i._grad)


torch.Size([5, 1, 1]) torch.Size([1, 1, 1])
False False
tensor([[[-5.7748e-11]]], grad_fn=<NativeLayerNormBackward>)
tensor(14400., grad_fn=<MseLossBackward>)
tensor([0., 0., 0.])

expected output non zero gradients for the model parameters
I am trying to train a encoder decoder to complete a sequence of numbers for example
input = 95, 100, 105, 110, 115 the corresponding output = 120.
I don’t know much about transformer that’s why I tried to make this.
also the example on the documentation site is not making sense to me

>>> transformer_model = nn.Transformer(src_vocab, tgt_vocab)
>>> transformer_model = nn.Transformer(src_vocab, tgt_vocab, nhead=16, num_encoder_layers=12)

what is src_vocab and tgt_vocab ? I know what they mean but which parameters of the nn.Transformer constructor are the corresponding to?

I’m also having a problem here. Can someone who is kind and smart dudes tell about this plz.

‘what is src_vocab and tgt_vocab ? I know what they mean but which parameters of the nn.Transformer constructor are the corresponding to?’

I’m having the same problem, but for the example part i guess it is a mistake from their side
nn.transformer doesn’t take source and target vocab size as it is only implementing the transformer part without the embeddings layer on the input data and without the linear layer on the output of the decoder,
in order to make it work d_model will be your embedding size and call an embedding layer on the source and on the target and the output of the transformer should pass through a linear that gets you the target vocab size

self.embed_src = nn.Embedding(src_vocab, emb_dim)
self.embed_trg = nn.Embedding(trg_vocab, emb_dim)
self.model = nn.Transformer( d_model = emb_dim,nhead=heads, self.num_encoder_layers=N, num_decoder_layers=N)
self.out_linear = nn.Linear(emb_dim, trg_vocab)

for the forward function it should be

src = self.embed_src(src) 
trg = self.embed_trg(trg)
output = self.model(src, trg)
output = self.out(output)

There is a typo in the doc and the PR is going to fix it. You don’t need src_vocab and tgt_vocab to initiate the transformer module. Here is a simple example:

import torch
import torch.nn as nn
transformer_model = nn.Transformer(nhead=16, num_encoder_layers=12)
src = torch.rand((10, 32, 512))
tgt = torch.rand((20, 32, 512))
out = transformer_model(src, tgt)

A full example to apply nn.Transformer for the word language model could be found here. The example applies both RNN and transformer modules, respectively.