I am trying to implement Word2Vec on dataset wiki8.

For every word I initialize a vector with 200dim. And I have 200,000+ words here, which means I have to update a large number of parameters every iteration.

I implement negtive sample, which means I just use some of the words to compute, say 10 words.

When doing backward, 10 words are used to compute but other 199,9990 words’ requires_grad are True, so 10 words has grad which is non-zero, 199,9990 has zero grad.

But when doing “optimizer.step()”, it’s very slow. Is it because 199,9990 words still update their parameters (which is useless: w = w + 0)

Here are my questions:

How do I tell the optimizer to update useful parameter?

How do I avoid the useless update?

Thanks!!