CNN + LSTM in case of seq to seq models video captioning

I have a CNN Layer followed by an LSTM Layer for Video captioning.

1.) The number of frames for each video input can be variable.
2.) A batch might load videos of different length during training time.

My CNN gets the input [Batchsize x time x 3(color) x Height_frame x Width_frame]

How should the CNN stage work. The CNN works on single frames. Would you do the parallelism across time ? or parallelism across batch size ? How do people generally do this ?

Also I want to know the right steps or stages of pad pack unpack for using LSTMs.

The CNN is learn able here, so you cant offline compute the features and then feed them to the LSTM