Best way to represent variable number of words as a vector?

I have a dataset of sentences in a non english language like :

  1. word1 word2 word3 word62
  2. word5 word1 word2

and the length of each sentence is not fixed, now i want to represent each sentence as a fixed sized vector and give it to my model and i want to keep as much information as possible in the embedding, and i dont want to have a maximum length for sentences because important information might happen in the end.

the only two approaches i can think of so far is :

  1. convert them to one hot vector and add them
  2. convert them to a word embedding and then add them

my questions are :

  1. what will happen if i use word2vec and embed them, then add the words in the sentence this way to represent the sentence, is this better than using hot one vectoring and adding them instead?
  2. is there any better way? what is the best approach to represent a variable length sentence without losing information from it like having a maximum length for each sentence ? ( i want all the words in the sentence to affect the embedding )

In principle, you can use one-hot vectors and add them. This is the most basic Bag-of-Words (BoW) model. If you use word embeddings, but you to have average them and not just add them. I’ve tried both approaches on simple dataset for educational purposes, and not surprisingly, the word embedding approach works a bit better. But there’s never any harm to try both and see.

The “problem” is that both are BoW approaches, and you loose any information that stem from the sequence of words. Therefore, RNNs (e.g., LSTMs and GRUs), CNNs or Transformers are more commonly used approaches. Here, you commonly have to make sure that all sentences (at least within the same batch) have the same length. Just look up “padding” of sequences. In a nutshell, you introduce a special word (index) to fill up shorter sentences. For your case, for example:

  1. word1 word2 word3 word62
  2. word5 word1 word2 word0

0 is commonly used as index to represent the padding token.

Thanks for the answer!

So is there any difference between adding the word embedding vs averaging them to represent the sentence?

also how can i use RNNs to embed a sentence, so the output of the RNN is the embedding? i didnt know we can use them in semi supervised tasks such as this since i don’t know the correct output of the sentence, any blogpost or project that explains how to do this in pytorch? and basically this is the best method to embed my sentences to vectors correct?

If you don’t average the word vectors, you skew your sentence embeddings w.r.t, the sentence lengths. Say you have three “sentences”

  • “dog cat horse”
  • “dog cat horse dog cat horse”
  • “dog cat horse dog cat horse dog cat horse”

If you add the word embeddings you get different sentence embeddings. In short, your sentence embedding scale up with the sentence lengths. Averaging the word embeddings makes the sentence embedding agnostic to the sentence lengths. In the example above, the 3 resulting sentence embeddings are the same, which is what you would want.

When it comes to RNNs, the sentence embedding essentially stems from the last hidden state. Note, that output is conventionally refers to all hidden states for a sequence. But in this case, I would recommend to get first familiar with RNNs in PyTorch. There are lots o tutorials.