What it means when model is unable to overfit small batch of examples, even though it is not small?

I am trying to train RNN model to classify sentences into 4 classes, but it doesn’t seem to work. I tried to overfit 4 examples (blue line) which worked, but even as little as 8 examples (red line) is not working, let alone the whole dataset.


I tried different learning rates and sizes of hidden_size but it doesn’t seem to help, what am I missing? I know that if the model is not able to overfit small batch it means the capacity should be increased but in this case increasing capacity has no effect.
The architecture is as follows:

class RNN(nn.Module):
    def __init__(self, input_size=1, hidden_size=256, num_classes=4):
        super().__init__()
        self.rnn = nn.RNN(input_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, num_classes)

    def forward(self, x):
        #x=[batch_size, sequence_length]
        x = x.unsqueeze(-1)  #x=[batch_size, sequence_length, 1] 
        _, h_n = self.rnn(x)  #h_n=[1, batch_size, hidden_size]
        h_n = h_n.squeeze(0)
        out = self.fc(h_n)  #out=[batch_size, num_classes]
        return out

The issue was due to the vanishing gradient, it got resolved by using batch normalization.

That can be the problem also here?