Combined CNN+LSTM Model Doesn't Learn During Training

Hello,

I’ve been trying to make a combined CNN+LSTM model to classify videos based off of a 11-class version of the Kinetics 400 dataset for human action recognition. I use the PyTorchVideo framework for data preprocessing and the dataloader, and use PyTorchLightning for the model implementation. I also use cross entropy loss and the Adam optimizer with an LR scheduler.

The problem is my model architecture doesn’t actually learn: its loss hovers around 2-6 ever since starting training. I assume this isn’t an issue with data preparation as I have successfully trained a Resnet50 on the exact same data with a normal-looking loss curve (started at around 20, stabilized to around 1 after a couple of epochs), but just to fill you guys in:

Data comes as a dictionary, and I extract what I need from it (the video split up into 8 frames and the label).

The video is in the shape of (B, C, L, H, W) initially and the label is a class index ranging from 0 to 10.

When I feed the model random data, it gives practically the same results (hovers around 2-6, doesn’t learn anything), so I assume it is an issue with my implementation. Playing around with hyper parameters also doesn’t change anything. I’ve been working on this issue for a couple of weeks now and came up with basically no solutions, so any help to make the model learn properly would be greatly appreciated.

Code for the CNN + LSTM model is below:

class CNNLSTMModule(pl.LightningModule):
    def __init__(self, out_features=11, in_features=3):
        super().__init__()

        self.conv_block = resnet101(pretrained=True)
        self.conv_block.fc = nn.Linear(self.conv_block.fc.in_features, 256)

        self.model = nn.LSTM(input_size=256, hidden_size=512, num_layers=1, batch_first=True)
        self.train_accuracy = Accuracy(task="multiclass", num_classes=11)
        self.fc1 = nn.Linear(512, 11)

        self.hidden = None
    def forward(self, x):
        #x = torch.rand((2, 3, 8, 256, 256)).cuda() -- The random data I fed, which gave same results

        x = torch.transpose(x, 1, 2) # Change X to shape of (B, L, C, H, W) from shape of (B, C, L, H, W)
        
        batch_size = x.size(0)
        seq_len = x.size(1)

        x = x.reshape(batch_size * seq_len, *x.shape[2:]) # Combine time and batch dimensions 
        x = self.conv_block(x)

        x = x.reshape(batch_size, seq_len, *x.shape[1:]) # Separate time and batch dimensions

        x, self.hidden = self.model(x, self.hidden)
        x = self.fc1(x)
        x = x[:, -1]

        return x

    def training_step(self, batch, batch_idx):

        x = batch["video"]

        y_hat = self.forward(x)

        loss = F.cross_entropy(y_hat, batch["label"])
        acc = self.train_accuracy(F.softmax(y_hat, dim=-1), batch["label"])   

        self.log("train_loss", loss, on_step=True, on_epoch=True, prog_bar=True)
        self.log(
            "train_acc", acc, on_step=True, on_epoch=True, prog_bar=True, sync_dist=True
        )

        return {"loss": loss}

    def validation_step(self, batch, batch_idx):

        x = batch["video"]

        y_hat = self.forward(x)

        loss = F.cross_entropy(y_hat, batch["label"])
        acc = self.train_accuracy(F.softmax(y_hat, dim=-1), batch["label"])   

        self.log("val_loss", loss, on_step=True, on_epoch=True, prog_bar=True)
        self.log(
            "val_acc", acc, on_step=True, on_epoch=True, prog_bar=True, sync_dist=True
        )

        return loss

    def test_step(self, batch, batch_idx):
        return self.validation_step(batch, batch_idx)

    def configure_optimizers(self): # exact same as used for resnet50
        optimizer = torch.optim.Adam(
            self.parameters(),
            lr=1e-2,
        )

        scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
            optimizer, 100, last_epoch=-1
        )

        return [optimizer], [scheduler]