I am trying to implement Word2Vec on dataset wiki8.
For every word I initialize a vector with 200dim. And I have 200,000+ words here, which means I have to update a large number of parameters every iteration.
I implement negtive sample, which means I just use some of the words to compute, say 10 words.
When doing backward, 10 words are used to compute but other 199,9990 words’ requires_grad are True, so 10 words has grad which is non-zero, 199,9990 has zero grad.
But when doing “optimizer.step()”, it’s very slow. Is it because 199,9990 words still update their parameters (which is useless: w = w + 0)
Here are my questions:
How do I tell the optimizer to update useful parameter?
How do I avoid the useless update?
Thanks!!