Video frame classificaion with a few long videos

I am trying to train a RNN to make a binary classification on the frames of a video stream.
I already tested a RNN model with the Jester dataset, now I want to train my model with my dataset.
My question is how to prepare my dataset, since it is made of 10 videos, each one with a length of 10 minutes, in average.

My first (and probably naive) idea was to batch each frame following the order in the video, but I know that it is important to shuffle your dataset when training. So I discarded this idea.

Any other suggestion?
Thank you.

For splitting the data into train, validation and test sets, you have to make sure that the test set does not include e.g. some seconds before and some seconds after a test sample since then the model might not actually generalize to entirely new videos. In general, it might make sense to use the beginning for training, the middle for validation and the end for testing.
Inside the three datasets you will create individual training examples.
Those training examples you can likely shuffle.
You could e.g. read through this blog post:

Thanks for the response, @floriandonhauser !
About these individual training samples, following some ideas on the blog spot you mentioned.
If I take a video with a duration of 600 seconds (10 minutes), extract samples of 3 seconds, with a stride of 1 second, the number of samples will be:

10*60 -(3 -1) = 596

Since I have 10 videos, I will have 596*10 samples. If I decrease the stride to half a second I approximately double this amount. It sound a good amount for me, especially knowing that maybe I can get some new videos in the future.

But I have one question about this. When this model gets deployed, it will run on a video stream that can be hours long. If I train on small samples (3 seconds), how will the recurrent cells (LSTM cells) learn to keep long term information on memory?

I will try to train the network with these ideas you presented, already helped me a lot!
Thank you very much.

First, I am no expert in using LSTM cells for video data.
As far as I understand, LSTMs are not perfect at keeping information over many invocations.
However, the question is, whether this is even necessary.
If you want to classify the current situation of your video stream, is history of e.g. more than 2 minutes ago really important? That entirely depends on the video data you are using.
Think about the kind of data and you goal and how much history the model needs to keep.

I don’t know for sure, but I think that a few minutes of history is enough.
But any way, can a network trained only with samples of a few seconds, learn how to keep the hitorory of a couple minutes?