Hello,

I’ve been trying to make a combined CNN+LSTM model to classify videos based off of a 11-class version of the Kinetics 400 dataset for human action recognition. I use the PyTorchVideo framework for data preprocessing and the dataloader, and use PyTorchLightning for the model implementation. I also use cross entropy loss and the Adam optimizer with an LR scheduler.

The problem is my model architecture doesn’t actually learn: its loss hovers around 2-6 ever since starting training. I assume this isn’t an issue with data preparation as I have successfully trained a Resnet50 on the exact same data with a normal-looking loss curve (started at around 20, stabilized to around 1 after a couple of epochs), but just to fill you guys in:

Data comes as a dictionary, and I extract what I need from it (the video split up into 8 frames and the label).

The video is in the shape of (B, C, L, H, W) initially and the label is a class index ranging from 0 to 10.

When I feed the model random data, it gives practically the same results (hovers around 2-6, doesn’t learn anything), so I assume it is an issue with my implementation. Playing around with hyper parameters also doesn’t change anything. I’ve been working on this issue for a couple of weeks now and came up with basically no solutions, so any help to make the model learn properly would be greatly appreciated.

Code for the CNN + LSTM model is below:

```
class CNNLSTMModule(pl.LightningModule):
def __init__(self, out_features=11, in_features=3):
super().__init__()
self.conv_block = resnet101(pretrained=True)
self.conv_block.fc = nn.Linear(self.conv_block.fc.in_features, 256)
self.model = nn.LSTM(input_size=256, hidden_size=512, num_layers=1, batch_first=True)
self.train_accuracy = Accuracy(task="multiclass", num_classes=11)
self.fc1 = nn.Linear(512, 11)
self.hidden = None
def forward(self, x):
#x = torch.rand((2, 3, 8, 256, 256)).cuda() -- The random data I fed, which gave same results
x = torch.transpose(x, 1, 2) # Change X to shape of (B, L, C, H, W) from shape of (B, C, L, H, W)
batch_size = x.size(0)
seq_len = x.size(1)
x = x.reshape(batch_size * seq_len, *x.shape[2:]) # Combine time and batch dimensions
x = self.conv_block(x)
x = x.reshape(batch_size, seq_len, *x.shape[1:]) # Separate time and batch dimensions
x, self.hidden = self.model(x, self.hidden)
x = self.fc1(x)
x = x[:, -1]
return x
def training_step(self, batch, batch_idx):
x = batch["video"]
y_hat = self.forward(x)
loss = F.cross_entropy(y_hat, batch["label"])
acc = self.train_accuracy(F.softmax(y_hat, dim=-1), batch["label"])
self.log("train_loss", loss, on_step=True, on_epoch=True, prog_bar=True)
self.log(
"train_acc", acc, on_step=True, on_epoch=True, prog_bar=True, sync_dist=True
)
return {"loss": loss}
def validation_step(self, batch, batch_idx):
x = batch["video"]
y_hat = self.forward(x)
loss = F.cross_entropy(y_hat, batch["label"])
acc = self.train_accuracy(F.softmax(y_hat, dim=-1), batch["label"])
self.log("val_loss", loss, on_step=True, on_epoch=True, prog_bar=True)
self.log(
"val_acc", acc, on_step=True, on_epoch=True, prog_bar=True, sync_dist=True
)
return loss
def test_step(self, batch, batch_idx):
return self.validation_step(batch, batch_idx)
def configure_optimizers(self): # exact same as used for resnet50
optimizer = torch.optim.Adam(
self.parameters(),
lr=1e-2,
)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
optimizer, 100, last_epoch=-1
)
return [optimizer], [scheduler]
```