Does my approach make sense? CNN-LSTM


I am working with sequences that I dont have sufficient data of it. I aim to train a model to perform binary classification on 30s-long sequences. However, I have sufficient 10s-long sequences.
As, a result I have used scalogram to train a CNN, which performed quite well on the 10s data.
Then, I have divided the 30s dta to 3x10s data and extracted features using the trained CNN. Then by keeping the CNN parameters frozen I have trained the LSTM of the CNN-LSTM architecture.

Does it make sense to perform well?
My intuitition is that since I am extracting the features on the 30s image using the CNN trained on the 10s sequences I will be able to train the LSTM and then use the CNN-LSTM model to classify the 30s sequences.
I couldnt find any reference about this for 1d-sequences (this idea was inspired by the Human Action Recognition projects)

PS: when I trained once it performed poorly. Then I retrained it and though for training the loss and accuracy was approximately constant, in validation/testing accuracy (mean accuaracy went from 15% to 83%) was improving a lot and the loss was decreasing slightly (0.69 to 0.51). This is about binary classification

Thank you.

There’s is no principle reason not to train a pure LSTM on 10s input and predict the output for 30s inputs.

Have you tried how such a baseline model does perform?

Do you mean 30s video sequences?

If so, have you considered just taking the baseline CNN outputs and sending them through self attention before the final output layer?

Were you having size issues with the 30s vs 10s?

I am working with ECGs and unfortunately I have very few examples of well labelled 30s ECGs and plenty of 10s well labelled ECGs.
Does the self attention allows for variable size inputs?

You could pad the 10s clips with zeros on the front and back so that 2/3s is sent into the model as zeros. Then just randomly mask the front and back of the 30s clips so they also have 2/3s going in as zeros, but the input size is the same as the 10s clips. It would have the same effect as masking words for NLP models or blocking parts of an image for image classification models.

Self attention doesn’t change input/output size. It just helps the model to learn to selectively focus on important features and filter out the noise.

The size of the inputs should reflect what size they would normally be in real world use.

Thank you I will try it.
However I dont know how well it will perform because the disease I try to detect might be detectable for a really short period of time, thus masking too much of the 30s might not be ideal

You only mask/augment during training. On testing/validation, should use entire 30s clips.

Thank you, I will test it and let you know!

Hello, I have been working on some other tests and now I will work on the self attention. Because I have never worked with attention before, should I use a CBAM just before the last layer that performs the binary classification since I am working with CNN.
I read that the multi head attention module in pytorch is for sequences (such as NLP) and I assume the features extracted cannot be treated as sequences.

Depends on the CNN dims. For Conv1D or Conv2D, you could likely adapt the AttentionBlock found here:

For Conv3D and greater, the dot product can get too large. So you can make use of Efficient Attention, which is here:

But you’ll need to adapt it for the dims.

I am working with resnet18, so from what you suggets I should use ConvBlock just after the AdaptiveAvgPool2d(output_size=(1, 1)) before the last fully connected layer, correct ?
Also, do I need only the AttentionBlock class or the other classes as well?

There are multiple modules listed on that page. It is NOT the ConvBlock. Scroll down to line 44 class AttentionBlock.

To see how it’s used, I suggest having a look at the UNet architecture in that link. It’s used at several junctures.

Basically, any time you want to help the model focus its attention, you can call that module.

It will give you the same size out as what you put into it.

My apologies. That particular example is used on junctures of the skip connection and the main path. Here is another attention module that just takes one input, it’s called LinearAttention on line 211:

I see and should I just have it after the feature extraction layer?
Also, I found this as well Attention in image classification - #3 by AdilZouitine

That sounds like a good idea. Keep in mind, focus implies a wide range of view to choose from.

You mean that the self attention might differ depending on each scenario?

Not clear on your question. I just mean attention is best applied when there is a lot of data involved. It likely wouldn’t do anything to your final binary classification output of size 1. So earlier in the forward pass would be more appropriate.