Lstm final state


I have trained a LSTM network (to solve embedding, LM and classification tasks). At inference time I am feeding an input text and reading the last hidden state of the network as the representation of the input sequence. My forward looks as follows:

def forward(self, input, hidden):
        """forward to model"""
        emb = self.word_embeddings(input)
        _, (hT, _) = self.rnn(emb, hidden)
        return hT

I would like to use the lstm to generate vectors for any pair of sequences s1 and s2 (say h1T and h2T respectively) and use the vectors to compute the distances via F.cosine_similarity

score = F.cosine_similarity(h1T,h2T,1,'1e-6').data[0]

the problem i am seeing is that the final word in the input sequence dominates the representation. For instance, if i have 2 LSTM vectors of the following 2 sentences

s1= hospital emergency room

s2= hospital 2017 budget

then the following test sentence does not score highly with the above 2 sentences

t1 = hospital emergency policy

whereas the test sentence

t2 = US congressional budget

scores highly to

s2 = hospital 2017 budget

Is there a good explanation for what could be a reason behind this? I thought the final state would represent earlier tokens as part of its representation.

thank you