N-gram vs CBOW in the tutorial

I’m looking at the http://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html

It says in the exercise section: “The Continuous Bag-of-Words model (CBOW) is frequently used in NLP deep learning. It is a model that tries to predict words given the context of a few words before and a few words after the target word. This is distinct from language modeling, since CBOW is not sequential and does not have to be probabilistic.”

What makes n-gram “sequential” or “probabilistic”? The only change to the n-gram code I did for this exercise was changing the trigram to “fivegram”, where the context is now 2+2 words before and after, rather than 2 words before the target word.

Am I misunderstanding something?

Besides the context gets doubled, the optimization problem has also changed to include the sum over embedded vectors in a context which wasn’t the case in N-gram model.

Oh, so we sum input vectors instead of concatenating them? I missed that part. Why would we want to do that?

Take a more closer look at the formulation of the problem involving logSoftmax(A(sum q_w) + b). Intuitively, it’s one way of gathering the contributions of the surrounding words. You may find this useful. Also spoiler alert for the solution I gave here.

Oh I think I got it. In the CBOW model, we want to look at the nearby words, but we don’t want to be constrained by any particular order of those words. So it’s both better and worse than n-gram model, because we throw away order information, but we gain flexibility of the context. Cool!

Oh I see how you added the tensors in your solution:

embedding = self.embedding(x).sum(dim=0)

That’s definitely better than how I did it:

vectors_sum = autograd.Variable(torch.Tensor(1,dimensionality).zero_())
for word in inputs:
     vectors_sum += self.embed(word).view(1, -1)