I am having hard time making the new transformer work. Following code has unexpected(to me) output. Gradients for the model parameters are zeros and so the optimizer step is of no use. The documentation for this module is not as explanatory as other like RNN. If someone can explain me how to make a encoder decoder Transformer work that that will be great.

**code**

```
import torch
print(torch.__version__)
X = torch.tensor([[[95.0]], [[100.0]], [[105.0]], [[110.0]], [[115.0]]])
y = torch.tensor([[[120.0]]])
print(X.shape, y.shape)
print(X.requires_grad, y.requires_grad)
model = torch.nn.Transformer(d_model=1, nhead=1, dim_feedforward=100, dropout=0)
criterion = torch.nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1)
parms = [j for j in model.parameters()][:3]
model.train()
optimizer.zero_grad()
y_pred = model(X,y)
print(y_pred)
print(y)
print(y_pred.requires_grad)
print(y_pred._grad)
loss = criterion(y_pred, y)
print(loss)
for i in parms: print(i._grad)
loss.backward()
print(y_pred._grad)
for i in parms: print(i._grad)
```

**output**

```
1.2.0
torch.Size([5, 1, 1]) torch.Size([1, 1, 1])
False False
tensor([[[-5.7748e-11]]], grad_fn=<NativeLayerNormBackward>)
tensor([[[120.]]])
True
None
tensor(14400., grad_fn=<MseLossBackward>)
None
None
None
None
tensor([[0.],
[0.],
[0.]])
tensor([0., 0., 0.])
tensor([[0.]])
```

**expected output** non zero gradients for the model parameters

I am trying to train a encoder decoder to complete a sequence of numbers for example

input = 95, 100, 105, 110, 115 the corresponding output = 120.

I don’t know much about transformer that’s why I tried to make this.

also the example on the documentation site is not making sense to me

```
>>> transformer_model = nn.Transformer(src_vocab, tgt_vocab)
>>> transformer_model = nn.Transformer(src_vocab, tgt_vocab, nhead=16, num_encoder_layers=12)
```

what is src_vocab and tgt_vocab ? I know what they mean but which parameters of the nn.Transformer constructor are the corresponding to?