I am not sure if someone uses Bert. I do not know how to interpret outputscores - I mean how to turn them into probabilities. I use pytorch version by huggingface, but for tf version output probabilities are log(softmax) which does not make sense in this case because the scores I get are positive number 1-10.
I use code from http://mayhewsw.github.io/2019/01/16/can-bert-generate-text/ and it is based on https://github.com/huggingface/pytorch-pretrained-BERT
Original sentence: i love apples. there are a lot of fruits in the world that i like, but apples would be my favorite fruit.
Masked sentence: i love apples . there are a lot of fruits in the world that i [MASK] , but apples would be my favorite fruit .
When I run through the pytorch version of bert, I get the following representations of probabilities:
Best predicted word: [‘love’] tensor(12.7276, grad_fn=)
Other words along with their probabilities:
[‘like’] tensor(10.2872, grad_fn=)
[‘miss’] tensor(8.8226, grad_fn=)
[‘know’] tensor(8.5971, grad_fn=)
[‘am’] tensor(7.9407, grad_fn=)
[‘hate’] tensor(7.9209, grad_fn=)
[‘mean’] tensor(7.8873, grad_fn=)
[‘enjoy’] tensor(7.8813, grad_fn=)
[‘want’] tensor(7.6885, grad_fn=)
[‘prefer’] tensor(7.5712, grad_fn=)
I am quite sure that this does not mean that probability for word “love” is proportional to 12.7276 and for word “like” is 10.2872.
I also know that the summ of all func(this number) thought the whole vocabulary is 1. But I do not know what the func is?
Thanks