Learning problem

Hello I used multihead attention in my network:

class SelfAttenion1Like(nn.Module):
    def __init__(self, num_classes):
        super(SelfAttenion1Like, self).__init__()
        self.multiHead = nn.MultiheadAttention(35, 5, 0.3)
        self.fc1 = nn.Linear(in_features = 3500, out_features = 2000)
        self.dropout1 = nn.Dropout(0.5)
        self.fc2 = nn.Linear(in_features = 2000, out_features = 12)
        self.swish = Swish()
    def forward(self, x):
        x = x.transpose(0,1).contiguous()
        x, y = self.multiHead(x,x,x)
        x = x.view(-1, 100*35)
        x = self.swish(self.fc1(x)) #also used relu, the same problem
        x = self.dropout1(x)
        x = F.softmax(self.fc2(x))
        return x

So, my input looks like (Batch, Timesteps, Features). For example:

input = torch.Tensor(np.random.random(size=(5, 100, 35)))

So, it seems to work with simple RNN and it is actually learning, when this model do not learn.

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr = lr)

What is the problem, why model does not learn?
Thanks in advance!

@aldeka12 Remove F.softmax at the end of the network. nn.CrossEntropyLoss internally applies softmax.

Hello, thanks for mentioning, did not know this! But still problem not in this, probably my dataset is not really suited for this kind of model.

Alright.
I am afraid I won’t be able to help you here, as I am not that much aware of RNNs.