It says in the exercise section: “The Continuous Bag-of-Words model (CBOW) is frequently used in NLP deep learning. It is a model that tries to predict words given the context of a few words before and a few words after the target word. This is distinct from language modeling, since CBOW is not sequential and does not have to be probabilistic.”
What makes n-gram “sequential” or “probabilistic”? The only change to the n-gram code I did for this exercise was changing the trigram to “fivegram”, where the context is now 2+2 words before and after, rather than 2 words before the target word.
Besides the context gets doubled, the optimization problem has also changed to include the sum over embedded vectors in a context which wasn’t the case in N-gram model.
Take a more closer look at the formulation of the problem involving logSoftmax(A(sum q_w) + b). Intuitively, it’s one way of gathering the contributions of the surrounding words. You may find this useful. Also spoiler alert for the solution I gave here.
Oh I think I got it. In the CBOW model, we want to look at the nearby words, but we don’t want to be constrained by any particular order of those words. So it’s both better and worse than n-gram model, because we throw away order information, but we gain flexibility of the context. Cool!