Hi, I’m working now at my diploma and I decided to do Image Captioning. I’ve already implemented CNN -> LSTM (without attention) and it works. Also, I found that when I made 2-layers LSTM performance increased.
Then I decided to replace RNN by Transformer using it almost in the same way (when in case of RNN I put vector that I got from pre-trained CNN to first time-step of LSTM and the caption as an output, in case of Transformer I put this vector to Transformer’s encoder, the caption to Transformer’s decoder and shifted one as aspected output).
So, after this, I found that it doesn’t work well. Maybe the reason for that is that I put to the batch captions of the same length and my tgt_key_padding_mas is always like [False, False, False, False ... ]
? (I don’t use src and memory mask, as my input is vector from CNN)
What do you think about that and can you suggest me something to increase the performance?