How to apply conv3d with batch of videos splitted in frames?

Good morning, everyone.
I am trying to do video classification on a dataset of videos that I have split into N frames per video. After this subdivision, I put all the videos in a tensor and through a data loader I pass a tensor (batch x frames x channels x height x width) to the network.
The frames are 122x112x3.
I just can’t figure out how I have to build the structure of the network so that N videos with K frames come in and N labels go out.

The following code does not work since I get in output a tensor 16x16x1x1x1 but I want 16x1 tensor:

model = torch.nn.Sequential(
    nn.Conv3d(min_video_frames, BATCH_SIZE, kernel_size=(
        3, 112, 112), padding=0),
    nn.Conv3d(BATCH_SIZE, 256, kernel_size=(
        3, 112, 112), padding=0),
    nn.Linear(256, 1)

You would need to flatten the activation before feeding it to the linear layer e.g. via nn.Flatten.
This would return an output tensor in the shape [batch_size, 1].