How does PyTorch use word embeddings to represent sentences for sentiment analysis?

kodlekereanup · March 21, 2021, 10:36am

I’ve done a lot of reading and I’ve seen that word2vec or GloVe embeddings are used in sentiment analysis tasks to retain semantic value of words. But one thing I still don’t understand is, how does a word embedding represent a sentence?

For example, for a twitter sentiment analysis project I want to classify a tweet into positive/negative/neutral.

Raw Input: The weather does not look good tonight.

Pre-processed Input before converting to vectors:  weather not good tonight

Just assume some random vectors for each of those words above.

weather: [34, 56…768] etc.

Now, how do I represent the sentence “weather not good tonight” using those word embeddings? I read somewhere that we need to average the word vectors for each of the words in the sentence and that’ll do the job.

But then how will it capture the order of the words in training? Also multiple inputs can lead to the same mean, how will the LSTM handle this?

vdw · March 25, 2021, 1:12pm

I’m not sure if I understand your questions, but a sentence is a sequence of words. So using word embeddings, a sentence is a sequence of embeddings (i.e., vectors). Thus a sentence is represented as a 2-dim tensor of shape (num_words, embed_dim).

In practice, you also of the dimension for the batch to handle multiple sentences in parallel. So the final representation is then a 3-dim tensor if shape (batch_size, num_words, embed_dim).

Note, in many documentations, you will probably see seq_len instead of num_words since LSTM are not only used for sentences.