Image Captioning Example Doubt in input size of Decoder LSTM

Hi, I’m new to Pytorch, there is a doubt that am having in the Image Captioning example code . In DcoderRNN class the lstm is defined as ,

self.lstm = nn.LSTM(embed_size, hidden_size, num_layers, batch_first=True)

in the forward function ,

embeddings = self.embed(captions)
embeddings = torch.cat((features.unsqueeze(1), embeddings), 1)

we first embed the captions and then concat the embeddings with the context feature from the EncoderCNN , but the concat increases the size from embed size how we can forward that to the lstm ? as the input size of lstm is already defined as embed_size .

Am I missing something here ?

Thanks in advance .

In the code batch size =128
embedding size is 256 (a word will be represented by a float tensor of size 256)
x is the max. length of caption in that batch of size 128
so,embeddings = self.embed(captions) will give you [128,x,256]
The feature vector the encoder output is of size [128,256]
features.unsqueeze(1) make it [128,1,256]
torch.cat around 1 will give you 128,x+1,256 so the embed size is not increased.

@arijitx Is my answer satisfactory did it helped you or u don’t agree with this solution.