BERT - Interpreting masked word prediction probabilities

I am not sure if someone uses Bert. I do not know how to interpret outputscores - I mean how to turn them into probabilities. I use pytorch version by huggingface, but for tf version output probabilities are log(softmax) which does not make sense in this case because the scores I get are positive number 1-10.

I use code from and it is based on

Original sentence: i love apples. there are a lot of fruits in the world that i like, but apples would be my favorite fruit.
Masked sentence: i love apples . there are a lot of fruits in the world that i [MASK] , but apples would be my favorite fruit .

When I run through the pytorch version of bert, I get the following representations of probabilities:

Best predicted word: [‘love’] tensor(12.7276, grad_fn=)
Other words along with their probabilities:
[‘like’] tensor(10.2872, grad_fn=)
[‘miss’] tensor(8.8226, grad_fn=)
[‘know’] tensor(8.5971, grad_fn=)
[‘am’] tensor(7.9407, grad_fn=)
[‘hate’] tensor(7.9209, grad_fn=)
[‘mean’] tensor(7.8873, grad_fn=)
[‘enjoy’] tensor(7.8813, grad_fn=)
[‘want’] tensor(7.6885, grad_fn=)
[‘prefer’] tensor(7.5712, grad_fn=)

I am quite sure that this does not mean that probability for word “love” is proportional to 12.7276 and for word “like” is 10.2872.
I also know that the summ of all func(this number) thought the whole vocabulary is 1. But I do not know what the func is?


It is BertForMaskedLM on page: it says:
Outputs the masked language modeling logits of shape [batch_size, sequence_length, vocab_size].

I guess this means that relative probability between numbers a and b should be 10^(a/b).

I am not sure what word logit means.