İmage Classification with Lstm and cnn

Hello,
I’m really beginner for the neural network. Any help is really appreciated :).
So my case is that;
I have a video dataset. I extract one video image frame and extract on audio spectrum as image of the video. I have two main folders -one includes video image frames and the other contains audio spectrums of each videos-. Each two main folder have 8 subfolders - which are the classes.

My model has two inputs -one image frame and one audio spectrum image-. Each input is transferred a pretrained model vgg16 paralelly for feature extraction. Then, result of these two inputs are concatinated into 8192 linear and then transferred the classification step. My problems begins here. I have to use LSTM for the Classification part. I could not combine Vgg ang Lstm, maybe it is not possible.

Any ideas?
Thank you,
Best regards

vggmodel = vgg16(weights=torchvision.models.VGG16_Weights.DEFAULT)
for param in vggmodel.features.parameters():
    param.require_grad = False
class MyModel(nn.Module):
    
    def __init__(self):
        super().__init__()
                m = vggmodel
        for param in m.parameters():
          param.requires_grad = False

        m.classifier[6] = nn.Identity() # replaced final FC layer with identity

        self.vgg16_modified = m
                
        self.rnn = nn.LSTM(
            input_size=8192,
            hidden_size=64,
            num_layers=1,
            batch_first=True)
            
        self.linear = nn.Linear(64, 8)
        
    def forward(self, x):        
        y1 = self.vgg16_modified(x["videoFrame"])   #VGG feature extraction for video image
        y2 = self.vgg16_modified(x["audioImage"])    #VGG feature extraction for video's audio spectrum
        #y1 = y1.view(y1.size(0), -1) #not sure, so commented
        #y2 = y2.view(y2.size(0), -1)#not sure, so commented

        y = torch.concat((y1, y2), 1) #Concatinate y1 and y2, each of results with 4096 size of layer and concatinate them

        #r_in = y.view(batch_size, 100, -1) #not sure, so commented

        r_in = y.view(1, batch_size, 8192)
        r_out, (_, _) = self.rnn(r_in)
        r_out2 = self.linear(r_out[:, -1, :])
        return F.log_softmax(r_out2, dim=1)

model = MyModel()
print(model)