Same output for entire batch

class FinalDecoder(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(FinalDecoder, self).__init__()

        self.hidden_size = hidden_size

        self.gru = nn.GRU(input_size=input_size, hidden_size=hidden_size, batch_first=False)
        self.fc_0 = nn.Linear(in_features=hidden_size, out_features=output_size, bias=True)
        self.relu = nn.ReLU()
        self.log_softmax = nn.LogSoftmax(dim=2)

    def forward(self, x, hidden):
        output, hidden = self.gru(x, hidden)
        output = output.transpose(1, 0)

        output = hidden

        output = self.fc_0(output)
        output = self.relu(output)
        output = self.log_softmax(output)

        return output, hidden

    def init_hidden(self, n_words):
        return torch.zeros(1, n_words, self.hidden_size)

Input
A ‘word tensor’ of shape (n_max_chars, n_words, 29), what is actually the padded output of my previous model.

hidden = self.decoder.init_hidden(n_words=word_tensors.shape[1]).to(self.device)
output, hidden = self.decoder(x=word_tensors, hidden=hidden)

Output
Another ‘word tensor’ of shape (n_words, 1, |vocabulary|), which is supposed to return the word index given the char probability sequence as the input.
The predictions tend to converge to one value, except for the two upper ones, they somehow almost always assume another value.

Predicted indices: tensor([[9895],
        [9895],
        [9895],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191],
        [8191]], device='cuda:0')

Additional Information
As the n_words basically represents the batch size, the batch size is pretty big (around 100). Therefore I tend do use a slightly high learning rate. I have, however; also tried multiple different values as this seems to be a ‘mean prediction’, but even a very small one (e.g. 1.0e-5) or a even bigger one produced the same result.
I use the Adam optimizer.
I’ve tried different loss functions. First I tried CTCLoss, as it is pretty dynamic. Therefore perfect for my use case, where predictions and targets almost never have the same length (at least at the beginning). I have some concerns about this, as it is meant to be used for kind of word wise classification, and my predictions won’t ever contain the blank label, but in my understanding of the CTCLoss, this should not be a problem. However, I have also tried CrossEntropy / LogSigmoid + NLLLoss.
Thanks for help in advance.

EDIT: the loss decreases while the output stays the same. Looks like the model likes to cheat.

[Padding (Nope)
After looking at some more logs, I’m pretty convinced that the problem is somehow caused by the padding. Padding is done by pad_sequence, padding the different ‘word tensors’ together. so that they all have the same ‘char probability length’.]

Regards,
Unity05