I am not sure if someone uses Bert. I do not know how to interpret outputscores - I mean how to turn them into probabilities. I use pytorch version by huggingface, but for tf version output probabilities are log(softmax) which does not make sense in this case because the scores I get are positive number 1-10.

I use code from http://mayhewsw.github.io/2019/01/16/can-bert-generate-text/ and it is based on https://github.com/huggingface/pytorch-pretrained-BERT

Original sentence: i love apples. there are a lot of fruits in the world that i like, but apples would be my favorite fruit.

Masked sentence: i love apples . there are a lot of fruits in the world that i [MASK] , but apples would be my favorite fruit .

When I run through the pytorch version of bert, I get the following representations of probabilities:

Best predicted word: [‘love’] tensor(12.7276, grad_fn=)

Other words along with their probabilities:

[‘like’] tensor(10.2872, grad_fn=)

[‘miss’] tensor(8.8226, grad_fn=)

[‘know’] tensor(8.5971, grad_fn=)

[‘am’] tensor(7.9407, grad_fn=)

[‘hate’] tensor(7.9209, grad_fn=)

[‘mean’] tensor(7.8873, grad_fn=)

[‘enjoy’] tensor(7.8813, grad_fn=)

[‘want’] tensor(7.6885, grad_fn=)

[‘prefer’] tensor(7.5712, grad_fn=)

I am quite sure that this does not mean that probability for word “love” is proportional to 12.7276 and for word “like” is 10.2872.

I also know that the summ of all func(this number) thought the whole vocabulary is 1. But I do not know what the func is?

Thanks