I want to do classification with a sequence of images. I essentially want the local features of each frame in a sequence obtained through convolution layers to go into an RNN layer. For this, I am using
nn.RNN(). My input is of the shape
(samples, channels, timesteps, height, width). I am not getting the intuition behind 3d convolutions. Because I have to preserve the
timesteps dimension, I am using
kernel_size=1 for the
timesteps dimension. So far, I have the following setup:
# input: (batchsize, channels, timesteps, height, width) layer1 = nn.Sequential( nn.Conv3d(1, 8, (1, 5, 5), nn.ReLU(), nn.MaxPool3d((1, 2, 2), (1, 2, 2)), ) layer2 = nn.Sequential( nn.Conv3d(8, 16, (1, 3, 3), nn.ReLU(), nn.MaxPool3d((1, 2, 2), (1, 2, 2)), ) layer3 = ... # output: (batchsize, channels', timesteps, height', width') # features: channels' x height' x width' # I am combining the channels', width' and height' dimension to represent input for the RNN output = output.view(output.size(0), output.size(2), -1) n_features = output.size(2) # (batchsize, timesteps, features) rnn = nn.RNN( input_size=n_features, hidden_size=100, num_layers=1, batch_first=True ) # Taking the output from the last state of the rnn and concatening the features for a Linear layer output = output[:, -1, :].view(output.size(0), -1) n_features = output.size(-1) fc = nn.Linear(n_features, n_classes) # loss function criterion = nn.CrossEntropyLoss() optimizer = optim.Adam(model.parameters(), lr=0.01, betas=(0.9, 0.999), eps=1e-8, weight_decay=1e-6)
As a sanity check, I tried to overfit the model for 200 epochs. The loss went down initially but then became constant. I want to know whether I am doing right with the 3d convolution layer and concatenating the dimensions as features for the RNN layer.