LSTM not learning

Hi,

I have build an encoder-decoder network and am training with the cocodataset to perform imagecaptioning. An academic example to learn how to assemble together CNNs-RNNs.

Well the point is that the net is training and reaching relative progress regarding the loss values. I don’t understand why it is only able to predict and memorize 2-4 words. I have compared it with the Pytorch advanced tutorial, and though I am using fixed caption length instead of variable, not using PackedSequence’s, I am not reaching the net to learn.

This is my model file: pls understand I just landed in this field so I am sure much of my code is improvable, however I believe the ground operation should be fine.

import torch
import torch.nn as nn
import torchvision.models as models
import numpy as np


class EncoderCNN(nn.Module):
    def __init__(self, embed_size):
        super(EncoderCNN, self).__init__()
        resnet = models.resnet50(pretrained=True)
        # for param in resnet.parameters():
        #     param.requires_grad_(False)
        
        modules = list(resnet.children())[:-1]
        self.resnet = nn.Sequential(*modules)
        self.embed = nn.Linear(resnet.fc.in_features, embed_size)
        self.bn = nn.BatchNorm1d(embed_size, momentum=0.01)

    def forward(self, images):
        features = self.resnet(images)
        features = features.view(features.size(0), -1)
        features = self.embed(features)
        features = self.bn(features)
        return features
    

class DecoderRNN(nn.Module):
    def __init__(self, embed_size, hidden_size, vocab_size, num_layers=1):
        super(DecoderRNN, self).__init__()
        self.embed_size = embed_size
        self.hidden_size = hidden_size
        self.vocab_size = vocab_size
        self.embed = nn.Embedding(vocab_size, embed_size)
        self.lstm1 = nn.LSTM(embed_size, hidden_size, num_layers);
        self.linear = nn.Linear(hidden_size, vocab_size);
        self.n_layers = num_layers;
    
    def forward(self, features, captions):
        # hc = self.init_hidden(captions.shape[1])
        hc = None
        captionsembed = self.embed(captions)
        #populating features
        #print(featuresad.shape[0])
        featuresconv = torch.zeros((features.shape[0], captions.shape[1], self.embed_size))

        for i in range(len(featuresconv)):
            for j in range(len(featuresconv[i])):
                if j==0:
                    featuresconv[i,j]=features[i]
                else:
                    featuresconv[i,j]=self.embed(captions[i,j-1])


        # hc = tuple([i for i in hc])
        x, (h,c) = self.lstm1(featuresconv, hc);
        #print('outputforward',x.shape)
        
        batch_size = x.size()[0]
        n_steps = x.size()[1]
        x = x.view(batch_size*n_steps, self.hidden_size)
        #print('stack',x.shape)
        x = self.linear(x)

        

        x = x.view(batch_size, n_steps, -1)


        return x

    def sample(self, features, states=None):
        """Generate captions for given image features using greedy search."""
        sampled_ids = []
        # inputs = features.unsqueeze(1)
        for i in range(30):
            hiddens, states = self.lstm1(features, states)          # hiddens: (batch_size, 1, hidden_size)
            outputs = self.linear(hiddens.squeeze(1))            # outputs:  (batch_size, vocab_size)
            _, predicted = outputs.max(1)                        # predicted: (batch_size)
            sampled_ids.append(predicted)
            features = self.embed(predicted)                       # inputs: (batch_size, embed_size)
            features = features.unsqueeze(1)                         # inputs: (batch_size, 1, embed_size)
        sampled_ids = torch.stack(sampled_ids, 1)                # sampled_ids: (batch_size, max_seq_length)
        return sampled_ids

    
    def init_hidden(self, batch_size):
        ''' Initialize hidden and cell state '''
        # Create two new tensors with sizes n_layers x n_seqs x n_hidden,
        # initialized to zero, for hidden state and cell state of LSTM
        weight = next(self.parameters()).data
        return (weight.new(self.n_layers, batch_size, self.hidden_size).zero_(),
                weight.new(self.n_layers, batch_size, self.hidden_size).zero_())

    

Loss starts a roughly 9.8 and get it down to 2.5… the net won’t learn any further. However I have tried running the Pytorch Image Captioning tutorial model, and got it down to the same loss value, but predictions were far better than the resulting from this model.

Many thanks for any hints on the right direction.

Regards,

Carlos.

The model is learning, that’s y the loss dropped to 2.5. But there will always be a value when the loss stops to decrease. I think the bad performance is caused by the design of ur model.

Thanks GM, but would be grateful if you would be anymore specific…

Carlos.

I’m sorry but I’m not familiar with good models in NLP.