Is my model too complex?

I am designing a model to search for patterns in videos and trace these patterns behaviour over time.
I’m using Conv3D+LSTM layers (model is present below).

The idea is to track patterns of different size over different periods of time.
Is my model architecture too complex for that task?
I am facing serious lack of memory to train the model.

I’d really appreciate a profeccional look at my case.

     class VideoModel(nn.Module):

    def __init__(self, num_frames, num_channels, num_classes):
        super(VideoModel, self).__init__()
        self.conv1 = nn.Conv3d(num_channels, 64, kernel_size=3, stride=1, padding=1)
        self.conv2 = nn.Conv3d(64, 128, kernel_size=5, stride=1, padding=2)
        self.conv3 = nn.Conv3d(128, 256, kernel_size=10, stride=1, padding=5)
        self.conv4 = nn.Conv3d(256, 512, kernel_size=25, stride=1, padding=12)
        self.conv5 = nn.Conv3d(512, 1024, kernel_size=25, stride=1, padding=12)  # Add this line
        self.lstm1 = nn.LSTM(512, 512, num_layers=2, batch_first=True, bidirectional=True)
        self.lstm2 = nn.LSTM(512, 512, num_layers=2, batch_first=True, bidirectional=True)
       self.fc = nn.Linear(512, num_classes)

    def forward(self, x):
        x = self.conv1(x)
        x = self.conv2(x)
        x = self.conv3(x)
        x = self.conv4(x)
        x = self.conv5(x) 
        x = x.view(x.size(0), x.size(1), -1)
        x, _ = self.lstm1(x)
        x, _ = self.lstm2(x)
        x = self.fc(x[:, -1, :])
        return x
   model = VideoModel(num_frames=151, num_channels=3, num_classes=1)
   criterion = nn.MSELoss()
   optimizer = optim.Adam(model.parameters(), lr=learning_rate)

We all are, otherwise we’d not be making good use of those pricey GPUs. :wink:

More seriously:

Four five thoughts (I started out at two — amongst my thoughts are such diverse elements as…):

  • You’re not using activation functions etc. What’s up with that? You should probably take a good look at a tried and tested model (e.g. ResNet) for the basics.
  • One typical thing is to reduce the resolution in a stack of convs. You might even try 2d convs (temporarily treating the time dimension as another “batch” dim). Should help with memory.
  • 3d is large! You could downscale (depending on where you are at now) before feeding into the network.
  • When you have something vaguely feasible, one typical sanity check is to try to train the model with a single batch, which the model should be able overfit on, i.e. get the error very small. If it cannot, your model is too rigid.
  • For training LSTMs, a standard thing to do is backpropagation through time (BPTT), a key element of which is to truncate the time to get by with finite memory.

In the olden das, a common thing to do was to do line character recognition with conv + lstm, you might check out some of those and draw inspiration from that.

Best regards


P.S.: I think it might be worth looking for some more in-depth advice, it would seem that the details of the above are a bit beyond what’s usually on the forums.

The kernel_sizes you’re using seem a bit large. Typically, you want kernels of size 3 or 5. If you’re going to experiment with larger sized kernels, I suggest doing these in split branches.

Additionally, for classification type tasks, maxpooling is good to perform between layers. This acts in tandem with the kernels to filter the right information for a classification task.

Lastly, batchnorm can be good to include for better stability.