Output of Decoder RNN contains extra second Dimension

I have developed an Encoder(CNN)-Decoder (RNN) network for image captioning in pytorch. The decoder network takes in two inputs- Context feature vector from the Encoder and the word embeddings of the caption for training. The context feature vector is of size = embed_size , which is also the embedding size of each word in the caption. My question here is more concerned with the output of the Class DecoderRNN. Please refer to the code below.

class DecoderRNN(nn.Module):
def init(self, embed_size, hidden_size, vocab_size, num_layers=1):
super(DecoderRNN, self).init()
self.embed_size = embed_size
self.hidden_size = hidden_size
self.vocab_size = vocab_size
self.num_layers = num_layers
self.linear = nn.Linear(hidden_size, vocab_size)
self.embed = nn.Embedding(vocab_size, embed_size)
self.lstm = nn.LSTM(embed_size, hidden_size, num_layers, batch_first = True)

def forward(self, features, captions):
embeddings = self.embed(captions)
embeddings = torch.cat((features.unsqueeze(1), embeddings),1)
hiddens,_ = self.lstm(embeddings)
outputs = self.linear(hiddens)
return outputs

In the forward function , I send in a sequence of (batch_size, caption_length+1, embed_size) (concatenated tensor of context feature vector and the embedded caption) . The output of the sequence should be captions and of the shape (batch_size, caption_length, vocab_size) , but I am still receiving an output of shape (batch_size, caption_length +1, vocab_size) . Can anyone please suggest what should I alter in my forward function so that extra 2nd dimension is not received? Thanks in advance

For LSTM model, the expected input shape is (seq_len, batch, input_size), in your case (batch, seq_len, input_size) because you set batch_first as True. Similarly, the expected output tensor shape is (batch, seq_len, hidden_size) in your case.

When you pass a tensor in the shape of (batch_size, caption_length+1, embed_size), you are saying the seq_len is caption_length+1. That’s why your output has the shape (batch_size, caption_length+1, vocab_size).

To get an output of the shape (batch_size, caption_length, vocab_size), your input should be of the shape (batch_size, caption_length, embed_size).

1 Like

So, can you please propose a solution on how do we incorporate the image feature context (from encoder) along with the caption embeddings to train an RNN for image captioning- Actually, thats the issue I am facing.