I am trying to fine tune x3d_s on my video dataset of approx. 700 vidoes. I removed 6. block of pretrained x3d and added 2 1D CNNs as well as ReLu, BatchNorm and Dropout layers.
After training for 40 epochs, the result doesnt seem right. Validation loss is often the same or lower than training loss, additionally, validation accuracy is higher than training loss. Also, accuracy does not seem to increase a lot…
How could I improve this model? Is the problem in the size of dataset?
class MyModel(nn.Module): def __init__(self): super().__init__() self.model = torch.hub.load('facebookresearch/pytorchvideo', 'x3d_s', pretrained=True) for param in self.model.parameters(): param.requires_grad = False self.model.blocks = nn.Identity() self.conv1 = nn.Conv1d(in_channels = 192*8*8, out_channels = 1024, kernel_size = 1) self.batch1 = nn.BatchNorm1d(1024) self.conv2 = nn.Conv1d(in_channels = 1024, out_channels = 256, kernel_size = 1) self.batch2 = nn.BatchNorm1d(256) self.relu = nn.ReLU() self.dropout = nn.Dropout(0.2) self.flat = nn.Flatten() self.fc1 = nn.Linear(11520, 256) self.fc2 = nn.Linear(256, 8) def forward(self,x): x = self.model(x) x = x.permute(0,2,1,3,4) x = x.reshape(x.size(0), x.size(1), -1) x = x.permute(0,2,1) x = self.conv1(x) x = self.relu(x) x = self.dropout(x) x = self.batch1(x) x = self.conv2(x) x = self.relu(x) x = self.dropout(x) x = self.batch2(x) x = self.flat(x) x = self.fc1(x) x = self.dropout(x) x = self.fc2(x) return x