RNN for Video Classification

I am trying to train a SqueezeNet with ConvLSTM cells to perform video classification on the 20BN-Jester dataset.
I never played with a recurrent model, so I am a little bit confused. I am starting my experiments with only the firsts 100 videos from the dataset, just to see if the model is being able to overfit this 100 videos, but even that I am being able to do.
My batches have shape (bs, seq_len 3, H, W). I am reshaping them to (bs* seq_len, 3, H, W), feeding through the squeeze net encoder, them reshaping to (bs, seq_len 512, H, W) so I can feed it to the convLSTM cell, reshaping back to (bs*seq_len, 512, H, W) to feed through the squeeze net classifier.

The implementation of convLSTM cell I am using is this one: https://github.com/ndrplz/ConvLSTM_pytorch/blob/master/convlstm.py
Here is the code of my model:

class SqueezeNet(nn.Module): 
    def __init__(self, n_classes=27):
        super(SqueezeNet, self).__init__()
        
        self.n_classes = n_classes

        sq = torchvision.models.squeezenet.squeezenet1_0(pretrained=False)
        self.sq_feaures = sq.features
        self.convLSTM = ConvLSTM(512, [512], [(3,3)], 1, True, True, False) # one convLSTF cell, with input size of 512, output size of 512, and 3x3 kernels.
        self.sq_classifier = sq.classifier
        self.sq_classifier[1] = nn.Conv2d(512, n_classes, kernel_size=(1, 1), stride=(1, 1)) 

        self.init_layers()

    def forward(self, images):
        """
        Forward propagation.
        :param images: images, a tensor of dimensions  (batch_size, seq_len, 3, H, W)
        :return: classes prdictions, a tensor of dimensions (N, T, n_classes)
        """
        b, s, c, h, w = images.shape
        images = images.view(b*s, c, h, w)
        
        out = self.sq_feaures(images) # (b*s, 512, H, W)
        _,_,h,w = out.shape
        out = self.convLSTM(out.view(b, s, 512, h, w))[0]
        out = out.view(b*s, 512, h, w) # (b*s, 512, H, W)
        out = self.sq_classifier(out) # (b*s, n_classes, 1, 1)
        return out.view(b, s, self.n_classes)

    def init_layers(self):
        for name, param in self.named_parameters():
            if name.endswith(".weight"):
                nn.init.xavier_uniform_(param)
            if name.endswith(".bias"):
                nn.init.constant_(param, 0.0)

Since the videos from Jester dataset haven’t the same number of frames (35 to 40), I am appending some copies of the last frame of each video at the end, so every video in the dataset has the same length. I am performing one prediction to each frame of each video, but I already tried one prediction for each video, and my loss just don’t decreases. This leads me to think that I am doing something really wrong, or the net work would be able to overfit such a small dataset (100 videos)
Someone can see what I am doing wrong? Or have any hint?