Video Classification with CNN+LSTM

Hi, I have started working on Video classification with CNN+LSTM lately and would like some advice. I have 2 folders that should be treated as class and many video files in them. I want to make a well-organised dataloader just like torchvision ImageFolder function, which will take in the videos from the folder and associate it with labels. I have tried manually creating a function that stores frames of all the videos from the folder in a list, it takes a hell lot of time. Also please suggest some premade functions which can help me save time through the pipeline. And apart from separating the videos into frames, is there any other sophisticated method of doing this. Thanks!

You can use custom dataset, where you will separate the videos into frames and associate them with labels in method def __getitem__(self, idx): : Writing Custom Datasets, DataLoaders and Transforms — PyTorch Tutorials 1.7.1 documentation

You can use also the ready-to-use code like: GitHub - YuxinZhaozyx/pytorch-VideoDataset: Tools for loading video dataset and transforms on video in pytorch. You can directly load video files without preprocessing.

Yea this was my first step. I have made my custom data loader and now I think I need to make the CNN Encoder and LSTM Decoder for the pipeline. Here is the code so far:

import datasets
import transforms
import video_csv_creation
import torch
import torchvision

video_folder_path = ''
csv_save_path = ''
data_loader_save = ''

# CUDA for PyTorch
use_cuda = torch.cuda.is_available()
device = torch.device("cuda:0" if use_cuda else "cpu")
torch.backends.cudnn.benchmark = True

# Parameters
params = {'batch_size': 2,
          'shuffle': True,
          'num_workers': 4}

dataset = datasets.VideoLabelDataset(csv_save_path,     
            transforms.VideoFilePathToTensor(100, fps=15, padding_mode='last'),transforms.VideoResize([256, 256]),
data_loader =, **params),data_loader_save)
print("Dalatoader Saved")
for videos, labels in data_loader:
    print(videos.size(), labels)

Now the data loader gives a tensor of - size batch_size x channels x frames x height x width and then the label something like-

torch.Size([2, 3, 50, 256, 256]) tensor([1, 1])
torch.Size([2, 3, 50, 256, 256]) tensor([1, 0])

Now how can I make a CNN which takes in a batch of 2 consisting of 50 frames of 3 channels each and then labels for both of the frame groups? So far I have only worked with CNN’s which takes batches consisting of frames and respective labels. Thanks.

I think you need use atensor with size batch_size x frames x channels x width x height. Then use CNN only for channels x width x height, Next the CNN network shold return a tensor with size batch_size x frames x features_from_CNN, so you can use a LSTM network to make final classification.

Something similar to:

Something similar to:

import torch
import torch.nn as nn
import torch.nn.functional as F

class CNN(nn.Module):
    def __init__(self):
        super(CNN, self).__init__()
        self.conv1 = nn.Conv2d(3, 10, 5)
        self.conv2 = nn.Conv2d(10, 20, 5)
        self.conv3 = nn.Conv2d(20, 30, 5)
    def forward(self, i):
        x = i.view(-1, i.shape[2], i.shape[3], i.shape[4])
        x = F.relu(self.conv1(x))
        x = F.relu(self.conv2(x))
        x = F.relu(self.conv3(x))
        x = nn.AvgPool2d(4)(x)
        x = x.view(i.shape[0], i.shape[1], -1)
        return x
class LSTM(nn.Module):
    def __init__(self):
        super(LSTM, self).__init__()
        self.lstm = nn.LSTM(750, 100)
        self.fc = nn.Linear(100*50, 2)
    def forward(self, x):
        x, _ = self.lstm(x)
        x = x.view(x.shape[0], -1)
        x = self.fc(x)
        return x    
x = torch.rand((64, 50, 3, 32, 32))
net_cnn = CNN()
net_lstm = LSTM()

features = net_cnn(x)
out = net_lstm(features)