I’m trying to develop a model for image captioning. The architecture that I’m using is a CNN + LSTM. The feature vector generated from the CNN is fed as the hidden and cell state of the LSTM.
The problem that I have is that the LSTM generates the word with the higher frequency in the vocabulary multiple times.(I’m using argmax in the decoder output).
I also tried using the output as a probability distribution but the BLEU score in the case was too low?
Can anyone tell me what the problem can be? If you want I can share the code too.
Thanks in advance!
There are couple of the usual suspects, but without seeing any code it would be purely guesswork.