Let’s focus first on the training folder and use the same approach to test and eval, if possible.
So in your training folder you have 6 different action folders, which represent different classes.
For example, boxing would be class0, and jogging class1.
In each of these “action (class) folders” you have 25 subfolders with frames from different persons performing the current action.
As far as I understand you don’t want to mix up the frames of different persons, i.e. you would like to get sequences of a single action from a single person. The next sequence might have another single action from another single person.
Is this correct?
EDIT: Do you want each sequence to have the same length, e.g. 10 images?
If so, do you want a sliding window approach, i.e.:
batch0: box_person0_image0, box_person0_image1, box_person0_image2, ... box_person0_image9
batch1: box_person0_image1, box_person0_image2, box_person0_image3, ... box_person0_image10
or rather:
batch0: box_person0_image0, box_person0_image1, box_person0_image2, ... box_person0_image9
batch1: box_person0_image10, box_person0_image11, box_person0_image12, ... box_person0_image19