Image Captioning Example Doubt in input size of Decoder LSTM

Hi, I’m new to Pytorch, there is a doubt that am having in the Image Captioning example code . In DcoderRNN class the lstm is defined as ,

`self.lstm = nn.LSTM(embed_size, hidden_size, num_layers, batch_first=True)`

in the forward function ,

``````embeddings = self.embed(captions)
embeddings = torch.cat((features.unsqueeze(1), embeddings), 1)
``````

we first embed the captions and then concat the embeddings with the context feature from the EncoderCNN , but the concat increases the size from embed size how we can forward that to the lstm ? as the input size of lstm is already defined as embed_size .

Am I missing something here ?