LSTM/GRU/RNN prefers short sequences

I am training an LSTM to predict a sigmoidal value based on a sentence as sequence of tokens.

The model consists of the standard LSTM implementation (also tried GRU and RNN) and uses word embeddings in an Embedding module.

The LSTM produces a sentence representation (last hidden state), transforms it into one dimensional space vie a linear layer and squashes the value between 0 and 1 using sigmoid.

This works perfectly fine for an easy case, where I use an artificial data set where all sentences of random length consist of one of two different words, with one word only appearing in sentences with high scores, the other one in those with low scores.

When I apply this model to my real dataset though, I observe that although loss decreases steadily, the trained model assigns the highest scores to very short sentences (~5-10 words) during testing. This is quite strange, because not only does this lead to bad evaluation results, but also short sentences are usually labeled with lower scores than longer (~30 words) sentences.

I am not that experienced with RNNs yet and haven’t encountered such a problem yet. Could anybody think of a reason why this happens?