Good morning, everyone.
I am trying to do video classification on a dataset of videos that I have split into N frames per video. After this subdivision, I put all the videos in a tensor and through a data loader I pass a tensor (batch x frames x channels x height x width) to the network.
The frames are 122x112x3.
I just can’t figure out how I have to build the structure of the network so that N videos with K frames come in and N labels go out.
The following code does not work since I get in output a tensor 16x16x1x1x1 but I want 16x1 tensor:
model = torch.nn.Sequential(
nn.Conv3d(min_video_frames, BATCH_SIZE, kernel_size=(
3, 112, 112), padding=0),
nn.BatchNorm3d(BATCH_SIZE),
nn.Conv3d(BATCH_SIZE, 256, kernel_size=(
3, 112, 112), padding=0),
nn.BatchNorm3d(256),
nn.Linear(256, 1)
)