Concatenating tensors for pack_padded_sequence

I’ve the following code where I want to concatenate captions and features. The captions tensor size is [5, 14, 256] and the features size after squeezing is [5, 1, 15000]. the line is returning an error as follows:

RuntimeError: Sizes of tensors must match except in dimension 2. Got 256 and 15000 (The offending index is 0)

I am curious how to properly concatenate them?

As a reference, I found this link but couldn’t figure out my solution:
Concatenate tensor of 3 dimensions to tensor of 1 dimension while keeping first dimension - PyTorch Forums

	def forward(self, features, captions, lengths):
	    """Decode image feature vectors and generates captions."""
	    embeddings = self.embed(captions)
	    embeddings =, embeddings), 1)
	    packed = pack_padded_sequence(embeddings, lengths, batch_first=True) 
	    hiddens, _ = self.lstm(packed)
	    outputs = self.linear(hiddens[0])
	    return outputs

The hidden state is the second return parameter. Well, together with the cell state. Try

out, (h, c) = self.lstm(packed)
output = self.linear(h[-1])

Thank you for the answer, well the problem isn’t with the self.lstm(packed). Problem lies there during concatenation. I mean the following line of code is returning error.

embeddings =, embeddings), 1)

where as I mentioned, the features.unsqueeze(1) size is [5, 1, 15000] and embeddings size is [5, 14, 256] . The problem is I want to concatenate them so that I can pass them to the pack_padded_sequence, followed by lstm layer.

Do you understand what I am trying to convey?

To be more specific, I’m copying the error as below:

Features loaded..!
epoch 0, of 4
lengths shape = torch.Size([5]) and targets shape = torch.Size([5, 14])
features shape = torch.Size([5, 15000]), captions shape: torch.Size([5, 14]), lengths = torch.Size([5])
features size: torch.Size([5, 1, 15000]), embeddings: torch.Size([5, 14, 256])
Traceback (most recent call last):
  File "", line 94, in <module>
  File "", line 73, in main
    outputs = decoder(features, captions, lengths)
  File "C:\Users\user02\anaconda3\envs\videocaptioning\lib\site-packages\torch\nn\modules\", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Users\user02\Documents\GitHub\ExplainedMVS\BaselineImageCaptioningTorch\", line 48, in forward
    embeddings =, embeddings), 1)
RuntimeError: Sizes of tensors must match except in dimension 2. Got 256 and 15000 (The offending index is 0)

Fair enough :)! I just skimmed through your post on my phone :).

Hm, difficult what to say. I’m not even sure what your trying to do. When you say that the captions tensor shape is [5, 14, 256] I assume that means [batch_size, seq_len, embed_dim] – i.e., that’s the tensor after the embedding.

Before the unsqueeze(), your feature tensor has a shape of [5, 15000], again assuming 5 is the batch size. That means you have 15k features for a data sample. Why would you want to concatenate them with the embeddings before the LSTM? Where is the notion of a sequence w.r.t. to the features.

I would expect the push the captions through the LSTM, get the last hidden state of shape [batch_size, embed_dim] (i.e., [5, 256]) and then concatenate them with the original feature tensors of shape [5, 15000] to get a tensor of shape [5, 15256].

But again, I don’t know the task and what the data is, so I’m only guessing here.