LSTM with 3D CNN for activity recognition

Hi friends. I like to recognize activity in video data using Conv3D + LSTM.

Only for testing, I coded:

conv1 = nn.Conv3d(in_channels=3, out_channels=64, kernel_size=3, padding=1)
pool1 = nn.MaxPool3d(kernel_size=2)

conv2 = nn.Conv3d(in_channels=64, out_channels=32, kernel_size=3, padding=1)
pool2 = nn.MaxPool3d(kernel_size=2)

#assume we have two video data separated
vid1 = torch.rand(1, 3, 25, 220, 240) # (N_samples, Ch, Frame, H, W)
vid2 = torch.rand(1, 3, 30, 220, 240)

x1 = conv1(vid1) # 25 Frames to be processed
x1 = pool1(x1)
x1 = conv2(x1)
x1 = pool2(x1) # output shape torch.Size([1, 32, 6, 55, 60]) - 32 channels, 6 Frames

x2 = conv1(vid2) # 30 Frames to be processed
x2 = pool1(x2)
x2 = conv2(x2)
x2 = pool2(x2) # output shape torch.Size([1, 32, 7, 55, 60]) - 32 channels, 7 Frames

After reading the LSTM docs, it has two basic parameters input_size and hidden_layers .

But I can figure out:

  1. how to setup correctly the LSTM parameters?
    input_size = 32ch * 6fram * H * W ?
    hidden_size = what value, only intuition, ex. 100 ?

Or it exists some relation among input_size and hidden_size ?

  1. And I have another issue, the number of frames should be equals for input to LSTM ? And for working over time).

Someone can help me to setup a basic example? Thanks.

1 Like
  1. There isn’t a necessary relation between input and hidden size. But we usually set it to numbers like 32, 64, 128(u can try what number is better).
  2. They must be presented the same in the batch, but they don’t have to be the same. You can, for example, pad inputs with different lengths into a certain max_length.

Thank you sir, I understood the item 1.

Related 2, it is still unclear for me. I see that the batch should be (B, C, n_frames, H, W) with n_frames equals for all samples.

By padding, do you say that a short sequence can be completed with frames of zeros until it achieves the desired n_frames equal for all sequences. Please, give a little feedback on this point.

Yes, pad the short sequences to a longer sequence. But u don’t have to pad it to global maximum, u can pad them only to the maximum in the batch. There are many ways to pad in addition to zero-padding, u can do some experiment with that.

Ok, thank you. I will work on it.