Does my approach make sense? CNN-LSTM

Ioannis_D · December 20, 2022, 9:14pm

Hello,

I am working with sequences that I dont have sufficient data of it. I aim to train a model to perform binary classification on 30s-long sequences. However, I have sufficient 10s-long sequences.
As, a result I have used scalogram to train a CNN, which performed quite well on the 10s data.
Then, I have divided the 30s dta to 3x10s data and extracted features using the trained CNN. Then by keeping the CNN parameters frozen I have trained the LSTM of the CNN-LSTM architecture.

Does it make sense to perform well?
My intuitition is that since I am extracting the features on the 30s image using the CNN trained on the 10s sequences I will be able to train the LSTM and then use the CNN-LSTM model to classify the 30s sequences.
I couldnt find any reference about this for 1d-sequences (this idea was inspired by the Human Action Recognition projects)

PS: when I trained once it performed poorly. Then I retrained it and though for training the loss and accuracy was approximately constant, in validation/testing accuracy (mean accuaracy went from 15% to 83%) was improving a lot and the loss was decreasing slightly (0.69 to 0.51). This is about binary classification

Thank you.

vdw · December 21, 2022, 3:05pm

There’s is no principle reason not to train a pure LSTM on 10s input and predict the output for 30s inputs.

Have you tried how such a baseline model does perform?

J_Johnson · December 23, 2022, 1:49pm

Do you mean 30s video sequences?

If so, have you considered just taking the baseline CNN outputs and sending them through self attention before the final output layer?

Were you having size issues with the 30s vs 10s?

Ioannis_D · December 25, 2022, 9:57am

I am working with ECGs and unfortunately I have very few examples of well labelled 30s ECGs and plenty of 10s well labelled ECGs.
Does the self attention allows for variable size inputs?

J_Johnson · December 25, 2022, 10:16am

You could pad the 10s clips with zeros on the front and back so that 2/3s is sent into the model as zeros. Then just randomly mask the front and back of the 30s clips so they also have 2/3s going in as zeros, but the input size is the same as the 10s clips. It would have the same effect as masking words for NLP models or blocking parts of an image for image classification models.

Self attention doesn’t change input/output size. It just helps the model to learn to selectively focus on important features and filter out the noise.

The size of the inputs should reflect what size they would normally be in real world use.

Ioannis_D · December 25, 2022, 10:41am

Thank you I will try it.
However I dont know how well it will perform because the disease I try to detect might be detectable for a really short period of time, thus masking too much of the 30s might not be ideal

J_Johnson · December 25, 2022, 10:59am

You only mask/augment during training. On testing/validation, should use entire 30s clips.

Ioannis_D · December 25, 2022, 6:22pm

Thank you, I will test it and let you know!

Ioannis_D · December 27, 2022, 1:39pm

Hello, I have been working on some other tests and now I will work on the self attention. Because I have never worked with attention before, should I use a CBAM just before the last layer that performs the binary classification since I am working with CNN.
I read that the multi head attention module in pytorch is for sequences (such as NLP) and I assume the features extracted cannot be treated as sequences.

J_Johnson · December 27, 2022, 2:58pm

Depends on the CNN dims. For Conv1D or Conv2D, you could likely adapt the AttentionBlock found here:

github.com

sfczekalski/attention_unet/blob/master/model.py

import torch
import torch.nn as nn


class ConvBlock(nn.Module):

    def __init__(self, in_channels, out_channels):
        super(ConvBlock, self).__init__()

        # number of input channels is a number of filters in the previous layer
        # number of output channels is a number of filters in the current layer
        # "same" convolutions
        self.conv = nn.Sequential(
            nn.Conv2d(in_channels, out_channels, kernel_size=3, stride=1, padding=1, bias=True),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(inplace=True),
            nn.Conv2d(out_channels, out_channels, kernel_size=3, stride=1, padding=1, bias=True),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(inplace=True)
        )

This file has been truncated. show original

For Conv3D and greater, the dot product can get too large. So you can make use of Efficient Attention, which is here:

github.com

cmsflash/efficient-attention/blob/master/efficient_attention.py

import torch
from torch import nn
from torch.nn import functional as f


class EfficientAttention(nn.Module):
    
    def __init__(self, in_channels, key_channels, head_count, value_channels):
        super().__init__()
        self.in_channels = in_channels
        self.key_channels = key_channels
        self.head_count = head_count
        self.value_channels = value_channels

        self.keys = nn.Conv2d(in_channels, key_channels, 1)
        self.queries = nn.Conv2d(in_channels, key_channels, 1)
        self.values = nn.Conv2d(in_channels, value_channels, 1)
        self.reprojection = nn.Conv2d(value_channels, in_channels, 1)

    def forward(self, input_):

This file has been truncated. show original

But you’ll need to adapt it for the dims.

Ioannis_D · December 27, 2022, 3:12pm

I am working with resnet18, so from what you suggets I should use ConvBlock just after the AdaptiveAvgPool2d(output_size=(1, 1)) before the last fully connected layer, correct ?
Also, do I need only the AttentionBlock class or the other classes as well?

J_Johnson · December 27, 2022, 3:29pm

There are multiple modules listed on that page. It is NOT the ConvBlock. Scroll down to line 44 class AttentionBlock.

To see how it’s used, I suggest having a look at the UNet architecture in that link. It’s used at several junctures.

Basically, any time you want to help the model focus its attention, you can call that module.

It will give you the same size out as what you put into it.

J_Johnson · December 27, 2022, 3:33pm

My apologies. That particular example is used on junctures of the skip connection and the main path. Here is another attention module that just takes one input, it’s called LinearAttention on line 211:

github.com

lucidrains/denoising-diffusion-pytorch/blob/main/denoising_diffusion_pytorch/denoising_diffusion_pytorch.py

import math
import copy
from pathlib import Path
from random import random
from functools import partial
from collections import namedtuple
from multiprocessing import cpu_count

import torch
from torch import nn, einsum
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader

from torch.optim import Adam
from torchvision import transforms as T, utils

from einops import rearrange, reduce
from einops.layers.torch import Rearrange

from PIL import Image

This file has been truncated. show original

Ioannis_D · December 27, 2022, 3:40pm

I see and should I just have it after the feature extraction layer?
Also, I found this as well Attention in image classification - #3 by AdilZouitine

J_Johnson · December 27, 2022, 3:43pm

That sounds like a good idea. Keep in mind, focus implies a wide range of view to choose from.

Ioannis_D · December 27, 2022, 3:45pm

You mean that the self attention might differ depending on each scenario?

J_Johnson · December 27, 2022, 4:20pm

Not clear on your question. I just mean attention is best applied when there is a lot of data involved. It likely wouldn’t do anything to your final binary classification output of size 1. So earlier in the forward pass would be more appropriate.