Word embeddings tutorial embeddings doesnt change after training

I’m following the word embedding tutorial from the Pytorch tutorials page here. I compared the values in the nn.Embedding layer before and after training and the values did not change by a lot. I am wondering whether this is due to:

  • small scale of the problem (small test document, embedding size)
  • comparing the incorrect things
  • the training is not enough and the loss is too high

The code is pretty much verbatim from the tutorial. I wrote the vectors to disk (in a similar format as output by word2vec/fasttext). Here is a head of the vectors before and after training:

==> before.vec <==
97 10
held: -0.47635 -0.00538 0.03114 -0.91018 0.82155 0.34278 0.41364 -0.54206 -0.15564 -0.90375
small -1.08468 -0.75569 -1.08190 -1.49247 -0.97938 -1.23464 2.02468 1.95724 0.92244 1.60459
proud -1.24031 1.05229 -0.21792 1.41590 -0.66731 1.00172 0.62765 1.17728 1.71781 0.74608
to 1.35696 -0.93113 1.62786 -0.61536 0.56246 0.59689 0.19493 0.01508 -0.84847 -1.13680
his -0.64310 -0.43424 0.56751 0.56234 1.47800 2.49974 -2.00864 -1.73962 0.92000 0.89800
say, -0.60591 -2.21703 0.80868 0.51814 0.73763 -0.12399 -0.36284 -0.03719 0.75559 1.37157
a 0.81077 1.01812 0.61868 -1.40281 -0.49369 -0.90551 0.00841 -0.11104 0.87211 0.16391
thou -0.96502 -0.49956 -0.20308 1.31124 -1.23211 2.14177 -0.68637 0.40443 -0.26006 0.37999
field, 0.38589 0.86313 -0.10661 -0.19126 2.88197 -0.80228 0.42361 1.36436 -0.52477 1.39875

==> after.vec <==
97 10
held: -0.47579 -0.00538 0.03034 -0.91066 0.82071 0.34304 0.41407 -0.54159 -0.15510 -0.90298
small -1.08346 -0.75622 -1.08146 -1.49210 -0.97906 -1.23362 2.02298 1.95819 0.92355 1.60536
proud -1.24031 1.05193 -0.21596 1.41722 -0.66788 1.00245 0.62989 1.17841 1.71748 0.74604
to 1.35664 -0.93089 1.62765 -0.61374 0.56187 0.59659 0.19331 0.01426 -0.84901 -1.13577
his -0.64278 -0.43336 0.56731 0.56225 1.47831 2.50070 -2.00857 -1.73981 0.91986 0.89775
say, -0.60615 -2.21782 0.80835 0.51809 0.73800 -0.12409 -0.36205 -0.03710 0.75553 1.37224
a 0.81011 1.01831 0.61801 -1.40217 -0.49177 -0.90454 0.00985 -0.11124 0.87107 0.16403
thou -0.96507 -0.49984 -0.20311 1.31315 -1.23123 2.14063 -0.68915 0.40313 -0.26036 0.37985
field, 0.38665 0.86381 -0.10656 -0.19033 2.88151 -0.80251 0.42322 1.36459 -0.52421 1.39849

As you can see they are almost the same (I’ve highlighted some differences). Here are the losses of the 10 epoch run:

[tensor([ 524.3104], device='cuda:0'),
 tensor([ 521.7900], device='cuda:0'),
 tensor([ 519.2901], device='cuda:0'),
 tensor([ 516.8104], device='cuda:0'),
 tensor([ 514.3484], device='cuda:0'),
 tensor([ 511.9043], device='cuda:0'),
 tensor([ 509.4778], device='cuda:0'),
 tensor([ 507.0662], device='cuda:0'),
 tensor([ 504.6693], device='cuda:0'),
 tensor([ 502.2869], device='cuda:0')]

Clearly the loss is a lot, even though its going down. So my question is am I doing something wrong? After training the content of the nn.Embedding is the word embeddings right? I didn’t include the code because I didn’t do anything different except writing the vectors out to disk.

I think your loss is looking good, it is decreasing steadily almost linearly it suggests that there is a long way to go. You should see a little bit longer maybe the difference might be larger. If I’m not mistaken this high loss is expected.

@Konpat_Ta_Preechakul Thanks for the reply.

I went and trained the model for about 2500 epochs. Here are the losses of the last 10 epochs:

[tensor([ 3.6269], device=‘cuda:0’),
tensor([ 3.6266], device=‘cuda:0’),
tensor([ 3.6263], device=‘cuda:0’),
tensor([ 3.6259], device=‘cuda:0’),
tensor([ 3.6257], device=‘cuda:0’),
tensor([ 3.6254], device=‘cuda:0’),
tensor([ 3.6249], device=‘cuda:0’),
tensor([ 3.6247], device=‘cuda:0’),
tensor([ 3.6243], device=‘cuda:0’),
tensor([ 3.6240], device=‘cuda:0’)]

The loss has significantly gone down. How much loss is acceptable? It may depend on the end application but I was wondering if I could get a yardstick.

Here is the head of the vectors:

==> before.vec <==
97 10
shall -1.51200 -0.13722 -0.02999 -0.29068 -0.43199 -0.41440 -1.26230 -0.00776 -0.90202 -0.47192
sum 0.04348 -0.17251 -0.32429 0.59110 0.58860 -0.63346 0.93111 -1.62142 -2.48102 0.89572
all -1.96656 1.74516 0.28631 0.21345 0.67012 1.27416 0.11448 -1.30043 -2.47435 0.11154
When -0.29002 0.21662 -0.99376 -1.33235 -0.72533 1.38633 -0.24751 -0.12052 -1.33563 -0.36146
art 0.77356 -0.97083 1.50570 -0.32661 -0.12820 -0.24594 2.36714 -0.47082 1.05813 -1.52117
succession 0.47500 0.70047 -0.47240 -0.39529 -0.22532 -0.11470 -0.43629 0.97439 1.50639 -0.66178
held: 1.23583 0.42217 0.53700 -0.72943 0.56429 -1.65103 -0.68944 1.54767 0.32436 1.65500
much 0.96233 -1.03391 -1.68684 -0.38755 0.16864 -0.49799 -0.34811 -2.36278 0.82771 1.00832
'This -0.09349 1.23545 0.62555 0.20450 -1.33267 0.14939 -1.50295 -1.37122 -1.14798 1.31911

==> after.vec <==
97 10
shall -1.80875 -0.27009 0.04512 -0.35795 -0.40817 -0.52694 -1.32084 0.03620 -1.05129 -0.51756
sum 0.03460 -0.15234 -0.32818 0.64060 0.65705 -0.63437 1.10220 -1.75300 -2.66330 0.95863
all -2.16454 1.86197 0.36141 0.23959 0.70754 1.38740 0.11906 -1.38467 -2.69791 0.10806
When -0.28593 0.23379 -1.06475 -1.41304 -0.75584 1.46333 -0.28699 -0.10579 -1.39061 -0.34203
art 0.77069 -1.02793 1.58267 -0.30294 -0.12727 -0.23977 2.47910 -0.50910 1.12315 -1.60805
succession 0.57382 0.80211 -0.56150 -0.42444 -0.24887 -0.07652 -0.52180 1.00314 1.64484 -0.75496
held: 1.32437 0.43335 0.53349 -0.77039 0.58206 -1.80131 -0.75251 1.65968 0.32944 1.74611
much 1.02373 -1.11061 -1.75183 -0.41527 0.20750 -0.46269 -0.32175 -2.53160 0.87068 1.03427
'This -0.09142 1.29842 0.65856 0.15368 -1.41159 0.14917 -1.56514 -1.44337 -1.23931 1.35460

There are much more differences here than the previous post. However, the variations are small. Is this normal? In my head, I’m thinking that the starting embeddings and the ending embeddings would be drastically different but I can only see very small changes.

I’m not an expert in this though, that’s about as far as I could go.