Hello I used multihead attention in my network:

```
class SelfAttenion1Like(nn.Module):
def __init__(self, num_classes):
super(SelfAttenion1Like, self).__init__()
self.multiHead = nn.MultiheadAttention(35, 5, 0.3)
self.fc1 = nn.Linear(in_features = 3500, out_features = 2000)
self.dropout1 = nn.Dropout(0.5)
self.fc2 = nn.Linear(in_features = 2000, out_features = 12)
self.swish = Swish()
def forward(self, x):
x = x.transpose(0,1).contiguous()
x, y = self.multiHead(x,x,x)
x = x.view(-1, 100*35)
x = self.swish(self.fc1(x)) #also used relu, the same problem
x = self.dropout1(x)
x = F.softmax(self.fc2(x))
return x
```

So, my input looks like (Batch, Timesteps, Features). For example:

```
input = torch.Tensor(np.random.random(size=(5, 100, 35)))
```

So, it seems to work with simple RNN and it is actually learning, when this model do not learn.

```
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr = lr)
```

What is the problem, why model does not learn?

Thanks in advance!