Classification on difficult video dataset with Pytorch

Hey all, had a quick question about classifying video data in Pytorch. I have a video dataset with clips ranging from 1-10 minutes that have a simple binary label attached to them, so it’s not possible to indicate which frames show the action occurring and which frames don’t.

I first wanted to ask, what’s the correct way to go about loading in this video? currently I’m just using a custom dataset class and batching every 5 frames of the video, then getting the next 5 frames of the same video until that video is done, shuffling the indices determining which video to grab at the start of each epoch. I was looking at GitHub - RaivoKoot/Video-Dataset-Loading-Pytorch: Generic PyTorch dataset implementation to load and augment VIDEOS during Training., but this isn’t much help as it presumes that every frame of the video is displaying an action, and each batch contains some frames of a different video.

Second, How should I go about classifying these videos? I can presume every 30 seconds/minute or so will include multiple examples of actions occurring. I was planning on using convolutional followed by LSTM layers, which I think might not be too good for holding events that happened a long time ago.

Any help at all would be appreciated!