How to use embedding with images for RNN encoder-decoder?

Hi all. I plan to use RNN-Encoder-Decoder to predict some actions in an image sequence. My idea is to first do some convolutions and feature extraction with fully connected layers to get a final vector of n elements. Then, I plan on sending this vector of n elements to the RNN encoder cell. Now, I heard that when dealing with words, the one-hot vector is converted to a lower-dimensional vector through embedding.

So, my question is, given that I have already done feature extraction from an image and obtained a lower-dimensional vector (n is really small), does it make sense to do embedding here? Can my convolution and fully-connected layers be treated as “embedding” the image?

Did you managed to get answer to this question? I am also interested in a similar problem.