I am trying to train a SqueezeNet with ConvLSTM cells to perform video classification on the 20BN-Jester dataset.
I never played with a recurrent model, so I am a little bit confused. I am starting my experiments with only the firsts 100 videos from the dataset, just to see if the model is being able to overfit this 100 videos, but even that I am being able to do.
My batches have shape (bs, seq_len 3, H, W). I am reshaping them to (bs* seq_len, 3, H, W), feeding through the squeeze net encoder, them reshaping to (bs, seq_len 512, H, W) so I can feed it to the convLSTM cell, reshaping back to (bs*seq_len, 512, H, W) to feed through the squeeze net classifier.
The implementation of convLSTM cell I am using is this one: https://github.com/ndrplz/ConvLSTM_pytorch/blob/master/convlstm.py
Here is the code of my model:
class SqueezeNet(nn.Module): def __init__(self, n_classes=27): super(SqueezeNet, self).__init__() self.n_classes = n_classes sq = torchvision.models.squeezenet.squeezenet1_0(pretrained=False) self.sq_feaures = sq.features self.convLSTM = ConvLSTM(512, , [(3,3)], 1, True, True, False) # one convLSTF cell, with input size of 512, output size of 512, and 3x3 kernels. self.sq_classifier = sq.classifier self.sq_classifier = nn.Conv2d(512, n_classes, kernel_size=(1, 1), stride=(1, 1)) self.init_layers() def forward(self, images): """ Forward propagation. :param images: images, a tensor of dimensions (batch_size, seq_len, 3, H, W) :return: classes prdictions, a tensor of dimensions (N, T, n_classes) """ b, s, c, h, w = images.shape images = images.view(b*s, c, h, w) out = self.sq_feaures(images) # (b*s, 512, H, W) _,_,h,w = out.shape out = self.convLSTM(out.view(b, s, 512, h, w)) out = out.view(b*s, 512, h, w) # (b*s, 512, H, W) out = self.sq_classifier(out) # (b*s, n_classes, 1, 1) return out.view(b, s, self.n_classes) def init_layers(self): for name, param in self.named_parameters(): if name.endswith(".weight"): nn.init.xavier_uniform_(param) if name.endswith(".bias"): nn.init.constant_(param, 0.0)
Since the videos from Jester dataset haven’t the same number of frames (35 to 40), I am appending some copies of the last frame of each video at the end, so every video in the dataset has the same length. I am performing one prediction to each frame of each video, but I already tried one prediction for each video, and my loss just don’t decreases. This leads me to think that I am doing something really wrong, or the net work would be able to overfit such a small dataset (100 videos)
Someone can see what I am doing wrong? Or have any hint?