How to get output results from decoder RNN

i have trained encoder-decoder model for image captioning and need to test the results for input test data,suppose test data is features with 512 size torch vector and hidden_size are also the same.

class DecoderRNN(nn.Module):
    def __init__(self, embed_size, hidden_size, vocab_size, num_layers, max_seq_length):
        """Set the hyper-parameters and build the layers."""
        super(DecoderRNN, self).__init__()
        self.embed = nn.Embedding(vocab_size, embed_size)
        self.lstm = nn.LSTM(embed_size, hidden_size, num_layers, batch_first=True)
        self.linear = nn.Linear(hidden_size, vocab_size)
        self.max_seg_length = max_seq_length
    def forward(self, features, captions, lengths):
        """Decode image feature vectors and generates captions."""
        embeddings = self.embed(captions)
        embeddings =, embeddings), 1)
        packed = pack_padded_sequence(embeddings, lengths, batch_first=True) 
        hiddens, _ = self.lstm(packed)
        outputs = self.linear(hiddens[0])
        return outputs
    def sample(self, features, states=None):
        """Generate captions for given image features using greedy search."""
        sampled_ids = []
        inputs = features.unsqueeze(1)
        for i in range(self.max_seg_length):
            hiddens, states = self.lstm(inputs, states)          # hiddens: (batch_size, 1, hidden_size)
            outputs = self.linear(hiddens.squeeze(1))            # outputs:  (batch_size, vocab_size)
            _, predicted = outputs.max(1)                        # predicted: (batch_size)
            inputs = self.embed(predicted)                       # inputs: (batch_size, embed_size)
            inputs = inputs.unsqueeze(1)                         # inputs: (batch_size, 1, embed_size)
        sampled_ids = torch.stack(sampled_ids, 1)                # sampled_ids: (batch_size, max_seq_length)
        return sampled_ids

so here is my input feature for the decoder, please let me know any suggestions for my approach. because I’m getting almost the same ouput sentence for every input;

decoder = DecoderRNN(embed_size, hidden_size, len(voc),num_layers,max_seq_length=20).to(device)
img_tensor =img.view(1,512)
img_tensor =
sampled_ids = decoder.sample(img_tensor)
sampled_ids = sampled_ids[0].cpu().numpy()

is this the correct way to generate LSTM output from decoderRNN?

Without the code of the encoder and just by looking a the code, it’s difficult to make a goo comment.

However, the line

outputs = self.linear(hiddens[0])

sure seems off. With batch_first=True the shape of hiddens is (batch_size, seq_len, hidden_size) (since you don’t use a bidirectional LSTM). So hidden[0] just considers to first sequence in your batch. I don’t think that’s what you want to do, and it’s a bit telling that your network is not throwing an error. So there might be other issues :).

suppose, the encoder just transfer learned features like VGG, so the encoder output 4096 and reduces to 512 (see How to concatenate CNN features and reduce size) .then nothing new in encoder side.