Sampling a batch of specific sets of frames

I have grayscale videos from 3 different classes.

I’d like to build a 3d conv network that would analyse a collection of frames (not necessarily concurrent) from each video and classify the video based on them.

Each class has its own folder. Within the folder, frames are named as follows: VidName_FrameNum (E4FG89_1 for the first frame, E4FG89_2 for the second and so on)

I’d like to get a samples of specific frames from a batch of videos.
I don’t know which will work best, so I need to be able to customise it. For example, get frames (1,5,10,15) or (3,10,17) or (1,3,5,7,9,11,13) etc.

I think the output tensor dimensions should look like (n_sets, n_frames, n_rows, n_cols)

Is there something built into pytorch to achieve this? how should i go about it?

Would love some assistance here.

Should I create a custom sampler and use it in data loader object?

Would you like to manually define these indices (e.g. [1, 5, 10, 15])?
If so, you could add your logic in the __getitem__ method of your Dataset or alternatively use a custom collate_fn.
Could you explain your workflow a bit more? E.g. would each “set” have the same number of frames?

Sure, I’ll explain in more detail:

I have a couple hundred videos, each with a different number of frames, all divided into 3 classes.
Because of the small amount of data, I’d like to somewhat augment it, by dividing each video into independent sets of frames.

Each class folder contains the relevant videos broken down into frames in the form of:
[[vid0_0, vid0_1, vid0_2, …], [vid1_0, vid1_1, vid1_2 … ], … ]

Instead of loading all frames for a given video, I’d like to load frames [0, 5, 10, 15… ] and then, separately, frames [1, 6, 11, 16, …] and so on, treating one video as if it were 5.

Skipping 5 frames in between is just an example, I’d ideally like that number to be configurable.

To simplify things for starters, let’s ignore the prospect of sets of frames from the same video making it into both the train AND test sets (which is of course a form of leakage).

If I perhaps pre-process the data into the following structure, it would make life a bit easier?
Downside of this approach is limited flexibility to shuffle and do things like cross validation.

- train_data
    - class_0
         - vid1_0
              - frame0
              - frame5
              - frame10
              ...
         - vid1_1
              - frame1
              - frame6
              - frame11
              ...
         - vid1_ ...
              ...
         - vid2_0
             - same structure as vid1
         ...
     - class_1
         - same structure as class_0

- val_data
    - same structure as train_data, in this instance 

For anyone that may run into this in the future: I solved the “problem” by pre-splitting videos and re-saving them as pickle files of tensors that contained the relevant frames. This is completed by a custom loader that unpickles these files on loading. This adds some overhead, but its easily solved by several workers.