Replacing LSTM by Transformer for Image Captioning

Hi, I’m working now at my diploma and I decided to do Image Captioning. I’ve already implemented CNN -> LSTM (without attention) and it works. Also, I found that when I made 2-layers LSTM performance increased.

Then I decided to replace RNN by Transformer using it almost in the same way (when in case of RNN I put vector that I got from pre-trained CNN to first time-step of LSTM and the caption as an output, in case of Transformer I put this vector to Transformer’s encoder, the caption to Transformer’s decoder and shifted one as aspected output).

So, after this, I found that it doesn’t work well. Maybe the reason for that is that I put to the batch captions of the same length and my tgt_key_padding_mas is always like [False, False, False, False ... ]? (I don’t use src and memory mask, as my input is vector from CNN)

What do you think about that and can you suggest me something to increase the performance?

Hi, were you able to solve this? I am currently replacing my LSTM model with transformers and was trying to find any existing literature or source. If you can help me it will be great

Hi, sorry for my delayed answer, but if it can help you, there is a new paper that can help you in your work. They are even available in the Transformers framework. But this work related to QA, not text generation. Quick googling gives me with a pre-trained model from torch hub :slight_smile:
But I kept my project with LSTM as a solution with the transformer failed due to my incorrect usage of the idea of the transformer architecture.