Hello,

I’m really beginner for the neural network. Any help is really appreciated :).

So my case is that;

I have a video dataset. I extract one video image frame and extract on audio spectrum as image of the video. I have two main folders -one includes video image frames and the other contains audio spectrums of each videos-. Each two main folder have 8 subfolders - which are the classes.

My model has two inputs -one image frame and one audio spectrum image-. Each input is transferred a pretrained model vgg16 paralelly for feature extraction. Then, result of these two inputs are concatinated into 8192 linear and then transferred the classification step. My problems begins here. I have to use LSTM for the Classification part. I could not combine Vgg ang Lstm, maybe it is not possible.

Any ideas?

Thank you,

Best regards

```
vggmodel = vgg16(weights=torchvision.models.VGG16_Weights.DEFAULT)
for param in vggmodel.features.parameters():
param.require_grad = False
class MyModel(nn.Module):
def __init__(self):
super().__init__()
m = vggmodel
for param in m.parameters():
param.requires_grad = False
m.classifier[6] = nn.Identity() # replaced final FC layer with identity
self.vgg16_modified = m
self.rnn = nn.LSTM(
input_size=8192,
hidden_size=64,
num_layers=1,
batch_first=True)
self.linear = nn.Linear(64, 8)
def forward(self, x):
y1 = self.vgg16_modified(x["videoFrame"]) #VGG feature extraction for video image
y2 = self.vgg16_modified(x["audioImage"]) #VGG feature extraction for video's audio spectrum
#y1 = y1.view(y1.size(0), -1) #not sure, so commented
#y2 = y2.view(y2.size(0), -1)#not sure, so commented
y = torch.concat((y1, y2), 1) #Concatinate y1 and y2, each of results with 4096 size of layer and concatinate them
#r_in = y.view(batch_size, 100, -1) #not sure, so commented
r_in = y.view(1, batch_size, 8192)
r_out, (_, _) = self.rnn(r_in)
r_out2 = self.linear(r_out[:, -1, :])
return F.log_softmax(r_out2, dim=1)
model = MyModel()
print(model)
```