# Word2vec CBOW mismatch

Hi everyone, this is my first post here, so sorry if I’m missing something.
I’ve implemented the word2vec algorithm following the c code posted by Mikolov, that can be found here.
This is my implementation of the CBOW algorithm with negative sampling, where

• u_embs are of dim [N, H], so N target vectors of dim H to be predicted by the context vectors
• v_embs are of dim [N, C, H], so for each N target vectors, I get C context vectors of dim H
• neg_v are of dim [N, M, H], so for each N target vectors, I get M negative vectors of dim H, where M is the negative sampling size choosen by the user
• pos_u, pos_v, neg_v contains the word ids of the target, context and negative examples respectively. pos_v has been padded with 0 to uniform the size of the context vectors
class CBOW(Word2Vec):
def __init__(self, emb_size, emb_dimension, cbow_mean=True):
super(CBOW, self).__init__(emb_size, emb_dimension)
self.cbow_mean = cbow_mean
init_range = 0.5 / self.emb_dimension
init.uniform_(self.v_embs.weight.data, -init_range, init_range)
init.constant_(self.u_embs.weight.data, 0)
self.v_embs.weight.data[0, :] = 0  # Set padding vector to 0

def forward(self, pos_u, pos_v, neg_v):
u_embs = self.u_embs(pos_u)  # u_embs are the "target" vectors
v_embs = self.v_embs(pos_v)  # v_embs are the "context" vectors

# Mean of context vector without considering padding idx (0)
if self.cbow_mean:
mean_v_embs = torch.true_divide(
v_embs.sum(dim=1),
(pos_v != 0).sum(dim=1, keepdim=True),
)
else:
mean_v_embs = v_embs.sum(dim=1)

score = torch.mul(u_embs, mean_v_embs)
score = torch.sum(score, dim=1)
score = F.logsigmoid(score)

neg_score = torch.bmm(self.v_embs(neg_v), u_embs.unsqueeze(2))
neg_score = F.logsigmoid(-1 * neg_score)

return -1 * (score.sum() + neg_score.sum())


All is working fine, execept for the results that I have when I try to evaluate the learned embeddings (I save the self.v_embs embedding).
Since I’ve also implemented the Skip-Gram algorithm, and the only thing that changes is the line score = torch.mul(u_embs, mean_v_embs), which becomes score = torch.mul(u_embs, v_embs) (in Skip-Gram u_embs and v_embs has the same dimensions) and there’s no mean to be computed, and since with the Skip-Gram algorithm I obtain similar results to the ones of gensim and Mikolov, I’m wondering if the culprit could be the mean computation.

Gensim CBOW

1 EN-SimVerb-3500.txt 3500 255 0.1324
2 EN-YP-130.txt 130 12 0.1754
3 EN-RG-65.txt 65 0 0.4973
4 EN-MEN-TR-3k.txt 3000 13 0.5335
5 EN-WS-353-REL.txt 252 1 0.5849
6 EN-SIMLEX-999.txt 999 7 0.2567
7 EN-MTurk-771.txt 771 2 0.5081
8 EN-MC-30.txt 30 0 0.5343
9 EN-RW-STANFORD.txt 2034 1083 0.3422
10 EN-WS-353-ALL.txt 353 2 0.6282
11 EN-WS-353-SIM.txt 203 1 0.6768
12 EN-MTurk-287.txt 287 3 0.6159
13 EN-VERB-143.txt 144 0 0.3538

Mine CBOW

1 EN-SimVerb-3500.txt 3500 255 0.1031
2 EN-YP-130.txt 130 12 0.1235
3 EN-RG-65.txt 65 0 0.3562
4 EN-MEN-TR-3k.txt 3000 13 0.4226
5 EN-WS-353-REL.txt 252 1 0.4534
6 EN-SIMLEX-999.txt 999 7 0.2395
7 EN-MTurk-771.txt 771 2 0.4255
8 EN-MC-30.txt 30 0 0.5637
9 EN-RW-STANFORD.txt 2034 1083 0.3147
10 EN-WS-353-ALL.txt 353 2 0.5190
11 EN-WS-353-SIM.txt 203 1 0.5775
12 EN-MTurk-287.txt 287 3 0.5172
13 EN-VERB-143.txt 144 0 0.3202

I know that the results are very similar, but not for all tests and beware that with Skip-Gram the results differ by 1% to 3%.
Sorry for the long post, and thank you all.
Federico

the difference here might caused by implement of CBOW in gensim.
when use sum of context word as input, it is divided by num of context when BP which is contrary to pytorch inner BP algorithm.
check discuss here Is this a bug in the CBOW code or my misunderstanding? · Issue #1873 · RaRe-Technologies/gensim · GitHub