Intergrating word embeddings make the output shape wrong

Hi everyone,
As the title says, when I use word embeddings, my model output becomes [batch_size, sequence_length, output_size] instead of [batch_size, output_size]. I wonder if this behaviour is expected from nn.Embedding? How can I obtain the output I desire [batch_size, output_size]?

class FeedFoward(nn.Module):
  def __init__(self, output_size, embedding_dim, hidden_size):
    super(FeedFoward, self).__init__()
    self.embedding = nn.Embedding.from_pretrained(embedding_matrix)
    self.linear_relu_stack = nn.Sequential(
      nn.Linear(in_features=embedding_dim, out_features=hidden_size),
      nn.Linear(in_features=hidden_size, out_features=hidden_size),
      nn.Linear(in_features=hidden_size, out_features=output_size),

  def forward(self, input):
    emb = self.embedding(input)
    out = self.linear_relu_stack(emb)
    return out

model = FeedFoward(OUTPUT_SIZE, EMBEDDINGS_DIM, HIDDEN_SIZE).to(device)

From the docs:

Input: (*), IntTensor or LongTensor of arbitrary shape containing the indices to extract
Output: (*,H), where * is the input shape and H=embedding_dim

Passing an input in the shape of [batch_size, sequence_length] will create an output in the shape [batch_size, sequence_length, embedding_dim] since each “word” will map to a vector in the embedding.

1 Like

I understand that. Would there be a common technique to fixing this available in pytorch nn (averaging layer or something) to obtain the dimension I want?

If you want to get rid of the sequence_length you could just call any reduction on this dimension (e.g. .sum(1), .mean(1), .max(1) etc. of you could use e.g. a linear layer to map the (static) sequence_length to a single value. If depends on your use case and especially if the sequence_length is static or not.

1 Like

I would recommend mean() to normalize w.r.t. the sequence lengths, in case they are very different.


@ptrblck I am truly sorry for having to bother you again. I have figured out a way to reduce dimensionality using the methods you gave me. However, I discovered that I have an option to keep the word embeddings as they are for each word. But if I do this, the output’s dimension will be all wrong for a Feedfoward network (as mentioned). How should I approach this?
PS: Yes my sequences length are all the same.

Don’t be sorry for asking questions.

Not necessarily. I understand you want to keep the temporal dimension instead of reducing it. In this case note that e.g. nn.Conv1d layers would accept inputs in [batch_size, channels, seq_len] (unsure if conv layers would be a good fit for your model) and also e.g. nn.Linear layers accept any input in [batch_size, *, in_features] where * denotes additional dimensions. The linear layer will then be applied on each timestep separately.
Would any of this work? I’m also sure @vdw might have good ideas how these inputs are usually handled.

1 Like

Hm, difficult to reply here :). What exactly is your input data and your task? It’s a bit odd that all your sentences have the same length.

It seems you’re trying to train a text classifier using a basic Feed Forward Network. This architecture does not allow to capture the order of words, so your sentences will be treated as a Bag-of-Words (BoW). This works alright for, say, classifying news articles into “politics”, “sports”, “entertainment”, etc. In this case aggregating the (pre-trained) word embeddings using, e.g., mean(), is the way to go. A Feed Forward network just does not support sequential data that way.

The obvious alternative would be to consider more advanced architectures such as Recurrent Neural Networks (RNN) or Concolutional Neural Networks (CNN) as @ptrblck already hinted at.

1 Like

Thanks for the suggestion! I am currently working building multiple models for this topic so I just got to stick to this Feedforward one first.