Problem with Video Classification Model

I have built a basic video classification neural network. The idea is to use a CNN to generate feature vector for each frame, combine it for a video, and then feed it into an LSTM. Here is the model -

class ConvNet(nn.Module):
    def __init__(self):
        super(ConvNet, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 33 * 33, 1200)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1,17424)
        #x = F.relu(self.fc1(x))
        x = self.fc1(x)
        return x

class VideoNet(nn.Module):
    def __init__(self):
        self.convnet = ConvNet()
    def forward(self,x):
        no_frames = list(x.shape)[1]
        features = []
        for i in range(no_frames):
        return torch.stack(features,dim=1)

class VideoLSTM(torch.nn.Module):
    def __init__(self,feature_size, hidden_dim):
        self.vidnet = VideoNet()
        self.lstm = nn.LSTM(feature_size, hidden_dim, batch_first=True)
        self.linear = nn.Linear(hidden_dim, 9)
        self.dropout = nn.Dropout(0.2)

    def forward(self, x):
        x = self.vidnet(x)
        x = self.dropout(x)
        lstm_out, (ht, ct) = self.lstm(x)
        return self.linear(ht[-1])

I tried to do the “Sanity Check” of feeding only one batching to see if the loss goes to zero. It turns out, the loss starts from 2.2 and only reaches around 1.8, even after 100 iterations… I am begin to think there is something fundamentally wrong with the network. In each iteration, I am ensuring that training functions are called -

    #for batch_idx, (data, target) in enumerate(train_loader):
    output = video_model(data)
    loss = criterion(output, target)

Could you please point out where am I going wrong? Thank you so much!