Hello everyone,
I know this is a rather general question, but maybe someone can point me in the right direction.
I have a dataset containing pictures from two different cameras.
In total I have 30 different labels (classes).
In each “timestep” (input I want to give to my network) contains 24 frames from each camera and has one of those 30 labels.
In total there are around 30000 of those 24-frame-“timesteps”.

The classic approach for image classification is usually some CNN, but this one is maybe closer to some sort of video classification. Due to the limited number of classes I don’t need any huge architecture, but I am not sure how to approach this. I saw that there is something called “LSTM” that takes temporal dependence into account, but I am not sure how this would work with input from two different cameras.
All sorts of help is appreciated.

I would still recommend starting with a convolutional neural network, but
with three-dimensional convolutions.

Think of each timestep as being a 3d image with shape [24, height, width]
(and two channels – one for each of your two cameras). Therefore the
input to your model – including the batch dimension – will have shape [nBatch, 2, 24, height, width].

So look for an off-the-shelf 3d CNN-based classification model, start with
an off-the-shelf 2d classification model but convert it to a 3d model by
replacing its Conv2ds with Conv3ds, or build your own 3d CNN-based
classification model by using a series of Conv3d layers followed by a
couple of fully-connected Linear layers.

Note that that third dimension – even if only of size 24 – will require more
memory than an otherwise similar 2d model.