I’m really beginner for the neural network. Any help is really appreciated :).
So my case is that;
I have a video dataset. I extract one video image frame and extract on audio spectrum as image of the video. I have two main folders -one includes video image frames and the other contains audio spectrums of each videos-. Each two main folder have 8 subfolders - which are the classes.
My model has two inputs -one image frame and one audio spectrum image-. Each input is transferred a pretrained model vgg16 paralelly for feature extraction. Then, result of these two inputs are concatinated into 8192 linear and then transferred the classification step. My problems begins here. I have to use LSTM for the Classification part. I could not combine Vgg ang Lstm, maybe it is not possible.
vggmodel = vgg16(weights=torchvision.models.VGG16_Weights.DEFAULT) for param in vggmodel.features.parameters(): param.require_grad = False class MyModel(nn.Module): def __init__(self): super().__init__() m = vggmodel for param in m.parameters(): param.requires_grad = False m.classifier = nn.Identity() # replaced final FC layer with identity self.vgg16_modified = m self.rnn = nn.LSTM( input_size=8192, hidden_size=64, num_layers=1, batch_first=True) self.linear = nn.Linear(64, 8) def forward(self, x): y1 = self.vgg16_modified(x["videoFrame"]) #VGG feature extraction for video image y2 = self.vgg16_modified(x["audioImage"]) #VGG feature extraction for video's audio spectrum #y1 = y1.view(y1.size(0), -1) #not sure, so commented #y2 = y2.view(y2.size(0), -1)#not sure, so commented y = torch.concat((y1, y2), 1) #Concatinate y1 and y2, each of results with 4096 size of layer and concatinate them #r_in = y.view(batch_size, 100, -1) #not sure, so commented r_in = y.view(1, batch_size, 8192) r_out, (_, _) = self.rnn(r_in) r_out2 = self.linear(r_out[:, -1, :]) return F.log_softmax(r_out2, dim=1) model = MyModel() print(model)