Conv3D with RNN

I want to do classification with a sequence of images. I essentially want the local features of each frame in a sequence obtained through convolution layers to go into an RNN layer. For this, I am using nn.Conv3d() with nn.RNN(). My input is of the shape (samples, channels, timesteps, height, width). I am not getting the intuition behind 3d convolutions. Because I have to preserve the timesteps dimension, I am using kernel_size=1 for the timesteps dimension. So far, I have the following setup:

# input: (batchsize, channels, timesteps, height, width)
layer1 = nn.Sequential(
    nn.Conv3d(1, 8, (1, 5, 5),
    nn.MaxPool3d((1, 2, 2), (1, 2, 2)),
layer2 = nn.Sequential(
    nn.Conv3d(8, 16, (1, 3, 3),
    nn.MaxPool3d((1, 2, 2), (1, 2, 2)),
layer3 = ...
# output: (batchsize, channels', timesteps, height', width')
# features: channels' x height' x width'

# I am combining the channels', width' and height' dimension to represent input for the RNN
output = output.view(output.size(0), output.size(2), -1)
n_features = output.size(2)

# (batchsize, timesteps, features)
rnn = nn.RNN(
# Taking the output from the last state of the rnn and concatening the features for a Linear layer
output = output[:, -1, :].view(output.size(0), -1)
n_features = output.size(-1)
fc = nn.Linear(n_features, n_classes)

# loss function
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01, betas=(0.9, 0.999), eps=1e-8, weight_decay=1e-6)

As a sanity check, I tried to overfit the model for 200 epochs. The loss went down initially but then became constant. I want to know whether I am doing right with the 3d convolution layer and concatenating the dimensions as features for the RNN layer.