I have developed an Encoder(CNN)-Decoder (RNN) network for image captioning in pytorch. The decoder network takes in two inputs- Context feature vector from the Encoder and the word embeddings of the caption for training. The context feature vector is of size = embed_size , which is also the embedding size of each word in the caption. My question here is more concerned with the output of the Class DecoderRNN. Please refer to the code below.
class DecoderRNN(nn.Module):
def init(self, embed_size, hidden_size, vocab_size, num_layers=1):
super(DecoderRNN, self).init()
self.embed_size = embed_size
self.hidden_size = hidden_size
self.vocab_size = vocab_size
self.num_layers = num_layers
self.linear = nn.Linear(hidden_size, vocab_size)
self.embed = nn.Embedding(vocab_size, embed_size)
self.lstm = nn.LSTM(embed_size, hidden_size, num_layers, batch_first = True)
def forward(self, features, captions):
embeddings = self.embed(captions)
embeddings = torch.cat((features.unsqueeze(1), embeddings),1)
hiddens,_ = self.lstm(embeddings)
outputs = self.linear(hiddens)
return outputs
In the forward function , I send in a sequence of (batch_size, caption_length+1, embed_size) (concatenated tensor of context feature vector and the embedded caption) . The output of the sequence should be captions and of the shape (batch_size, caption_length, vocab_size) , but I am still receiving an output of shape (batch_size, caption_length +1, vocab_size) . Can anyone please suggest what should I alter in my forward function so that extra 2nd dimension is not received? Thanks in advance