Video Classification: RNN (GRU) Only predict last category

opam22 · April 9, 2018, 8:54am

Hello there,

Currently I’ve been working on my thesis for video classification, I use CNN as feature extractor and use the output from cnn (-1) as input on RNN.

This is my code for RNN

num_classes = 2
input_size = 2048
hidden_size = 128
batch_size = 1
num_layers = 2 

class Model(nn.Module):

    def __init__(self):
        super(Model, self).__init__()
        self.rnn = nn.GRU(input_size=input_size, num_layers=num_layers, hidden_size=hidden_size, batch_first=True, dropout=1)
        self.fc = nn.Linear(hidden_size, num_classes)

    def forward(self, embedded, seq_len):
        hidden = self.init_hidden(seq_len)

		# Pack them up nicely
        embedded = embedded.view(seq_len, batch_size, input_size)
        # propagate input through RNN
        out, hidden = self.rnn(embedded, hidden)
        out = self.fc(out[-1])
        if use_gpu:
            out = out.cuda()

        return out

    def init_hidden(self, seq_len):

        if use_gpu:
            return Variable(torch.zeros(num_layers, seq_len, hidden_size), requires_grad=True).cuda()

        return Variable(torch.zeros(num_layers, seq_len, hidden_size), requires_grad=True)

The code run well but I got ugly result.
On the train data it successfully to predict on two category, however it predicts only last category on test data. Here’s the log on the training process

train Loss: 0.6007287226361679 Acc: 0.7712418300653595 category1 corrects: 157 category2 corrects: 79
test Loss: 1.1236150416044088 Acc: 0.5384615384615384 category1 corrects: 0 category2 corrects: 35
train Loss: 0.7298363817283531 Acc: 0.6209150326797386 category1 corrects: 94 category2 corrects: 96
test Loss: 0.8569914045242163 Acc: 0.5384615384615384 category1 corrects: 0 category2 corrects: 35
train Loss: 0.7280107533522681 Acc: 0.5751633986928104 category1 corrects: 83 category2 corrects: 93
test Loss: 0.765527761899508 Acc: 0.5384615384615384 category1 corrects: 0 category2 corrects: 35
train Loss: 0.7309918669508952 Acc: 0.5228758169934641 category1 corrects: 73 category2 corrects: 87
test Loss: 0.7198472220164079 Acc: 0.5384615384615384 category1 corrects: 0 category2 corrects: 35
train Loss: 0.723503941901369 Acc: 0.48366013071895425 category1 corrects: 69 category2 corrects: 79
test Loss: 0.7001942808811481 Acc: 0.5384615384615384 category1 corrects: 0 category2 corrects: 35

I dont know why my rnn never predict well on category1 (corrects always 0).

My dataset is likely small, 150 for training and 35 for testing.

I would be happy if someone here help me solve this issue

Thanks.

omair-kg · April 9, 2018, 10:06am

You said your testing set has 35 sequences and your output shows that you detect 35 category2 sequences correctly. Do you only have category2 sequences in your test set?

opam22 · April 9, 2018, 10:15am

No, I mean I have 35 videos data for testing, category1[35 videos], category2[35 videos], it’s 70 videos.

It’s many-to-one rnn, my rnn accept many(frames video) and output 1 category.

omair-kg · April 9, 2018, 10:34am

Well in that case you are definitely over fitting on category 2. Do you also have 150 training videos for each class?

opam22 · April 10, 2018, 4:52am

Yeah I do have 150 for each category, any insight how to prevent overfitting in this case?

ptrblck · April 10, 2018, 9:01am

You could try to apply data augmentation on the input data. Depending on your dataset you could try to crop, change the brightness, color etc.

Also regularization might help, i.e. weight decay, Dropout, etc.

Are you training the whole network from scratch? If so, probably a pre-trained CNN will help, so that you just have to train the RNN.

opam22 · April 10, 2018, 9:26am

Yea I’m using pretrained resnet and finetuning it with my dataset(frame). the problem is in the rnn, I do use dropout=1.

ptrblck · April 10, 2018, 9:41am

Have you tried to freeze the pre-trained resnet, i.e. not finetuning it?
How similar is your data to ImageNet?

omair-kg · April 11, 2018, 12:13pm

Try with first a frozen resnet and only training the recurrent layers. Also use augmentations (cropping, color jitter, translation, rotations).