Hi friends. I like to recognize activity in video data using Conv3D + LSTM.
Only for testing, I coded:
conv1 = nn.Conv3d(in_channels=3, out_channels=64, kernel_size=3, padding=1)
pool1 = nn.MaxPool3d(kernel_size=2)conv2 = nn.Conv3d(in_channels=64, out_channels=32, kernel_size=3, padding=1)
pool2 = nn.MaxPool3d(kernel_size=2)#assume we have two video data separated
vid1 = torch.rand(1, 3, 25, 220, 240) # (N_samples, Ch, Frame, H, W)
vid2 = torch.rand(1, 3, 30, 220, 240)x1 = conv1(vid1) # 25 Frames to be processed
x1 = pool1(x1)
x1 = conv2(x1)
x1 = pool2(x1) # output shape torch.Size([1, 32, 6, 55, 60]) - 32 channels, 6 Framesx2 = conv1(vid2) # 30 Frames to be processed
x2 = pool1(x2)
x2 = conv2(x2)
x2 = pool2(x2) # output shape torch.Size([1, 32, 7, 55, 60]) - 32 channels, 7 Frames
After reading the LSTM docs, it has two basic parameters input_size
and hidden_layers
.
But I can figure out:
- how to setup correctly the LSTM parameters?
input_size = 32ch * 6fram * H * W ?
hidden_size = what value, only intuition, ex. 100 ?
Or it exists some relation among input_size
and hidden_size
?
- And I have another issue, the number of frames should be equals for input to LSTM ? And for working over time).
Someone can help me to setup a basic example? Thanks.