I’m trying to implement a model for image-text joint embedding, which is trained by optimizing the pariwise ranking loss. However I noticed a problem about LSTM’s last hidden state (often used as the embedding of the input sequence) – its statistical property is a bit weird.
The input batch is (160 sequences, 18 max sequence length) as shown in the picture:
and the last hidden state from a fresh (untrained) LSTM looks like this:
There are obvious vertcial stripes. Word embedding in use is torchtext.vocab.GloVe.
This statistical problem leads to inferior performance and a bad joint embedding space. After some iterations, vertical stipes will also appear in the cnn feature matrix.
Does anyone know how to solve this problem? Thanks in advance.