Creating an LSTM network to take in muiltiple inputs

I am trying to build an LSTM network that will predict the next frame in a sequence based on the current frame and the action taken (there is an action for every frame). I currently encode all frames into a latent vector of size 128, and the actions are represented by an array of size 10. How would i format the input into the LSTM network?

For example, if I have a video consisting of 3191 frames, i will have a tensor of shpe (3191 x 128) for all encoded frames, would appending the action array associated with each frame to the latent vector work? Or is there a a way of inputting encoded frames and actions separately into the LSTM?